semanticate: Work on the XML elements tree

gvtulder commented 2 months ago

Rough example (see #752) of how the semanticate rules could be applied to the content only by walking recursively through the XML tree. Not sure if this is a good idea, so this is a fairly quick proof-of-concept.

The tricky bit is that the regular expressions might introduce new XML elements. In this version, this works by applying the regex to text string (which doesn't contain any tags), parsing the regex output to create any new XML elements, and finally inserting those in the document at the right place.

acabal commented 2 months ago

I understand the reasoning for this PR, but I think regex is acceptable for this use case 95% of the time. It's much simpler to understand and - importantly - maintain, and with full XML parsing there will likely be complex edge cases that we will have to custom code around. Also any errors introduced by a regex are easily corrected using git diff - it never results in big catastrophes.

If you want to play around with it further go ahead, but I think you'll find that this will become very unwieldy very fast. If you're able to put together a complete solution that is 100% correct vs. our current implementation's 95% correct, and it doesn't look like a maintenance hellscape, then we can consider it.

alerque commented 2 months ago

Sure 95% of the time a regex will get you there, but your existing regexes are needing to be stuffed with lookaheads and other features to compensate for the lack of context. Flatly XML cannot be parsed with regexes. On the other hand walking XML trees using something built for the job is pretty straight forward and should be much easier to maintain and contribute to in the long run.

acabal commented 2 months ago

I understand. We do not need lookaheads in the general case (or, strictly speaking, in this case). That is because the occasions in which the regexes without lookaheads are wrong, given our narrow use cases, are rare, and undoing the mistake one of these regex makes is simple - an error is never catastrophic. Remember that these tools are not general purpose XML processors, they are for the very narrow use case of ebooks in the English language that are made with our style guide.

On the other hand, parsing an entire syntax tree and dealing with the hundreds of edge cases of the various English language constructs we must examine is extremely complex, error prone, and most importantly, difficult to maintain. It will also likely be quite slow because the tree must be walked and updated for almost every check requiring even the most basic of logic.

I understand the desire for intellectual purity and formal correctness, but using regex is easier to develop and maintain as a practical matter. Of course I'm happy to be proved wrong!

gvtulder commented 2 months ago

I agree with the point that regexes are simpler, and I don't know where the sweet spot lies. However, a more XML-based approach could still use a lot of regexes, the main difference is choosing where they are applied.

In my current implementation, the regex rules are very similar to what they are in the current implementation. I copy-pasted them from the old function and made only some minor changes:

https://github.com/standardebooks/tools/blob/b4e9357de64707d543d42d0fd8bb0d0a53f5fc2e/se/formatting.py#L176-L180

So the representation of the rules doesn't have to change much. The difference is that these regexes are no longer applied to the entire raw XML document at once, but only to one bit of text at a time. This avoids changes to attribute values and makes it easier to say things like: "do not apply these rules inside <abbr> tags".

One thing that does become a bit more difficult is when the regexes look at XML tags. For example, when deciding to add class="eoc" to <abbr> tags at the end of a paragraph <p>: (original regexes in the comments)

https://github.com/standardebooks/tools/blob/b4e9357de64707d543d42d0fd8bb0d0a53f5fc2e/se/formatting.py#L228-L239

Or a simpler example, checking we're inside a <p> and using the $ marker in the regex:

https://github.com/standardebooks/tools/blob/b4e9357de64707d543d42d0fd8bb0d0a53f5fc2e/se/formatting.py#L192-L194

Secondly, it also becomes more difficult to apply regexes that match text on both sides of a tag, but there currently aren't any rules that need this.

I'm not convinced this is necessarily better, but it's good to know that it's not a binary choice between regex and syntax tree parsing.

standardebooks / tools

semanticate: Work on the XML elements tree #753