Open tabatkins opened 8 years ago
So high-level design:
Step 1 should be searching over the possible-starts for all the turned-on shorthands at the same time; I cannot execute a full-document walk for every single one.
As I note in the OP, I don't need to do any crazy descent-tracking; I just look at a single element's text nodes at a time. When something gets' recognized, it consumes some chunk of the element's text nodes and child elements, then I continue on after that point looking for more. Then when the element's text is fully consumed, I do a fresh walk over child elements (because the list might have changed) and recurse.
So I think for the simultaneous-design, I need to put together a parse-trie, so I can quickly and easily tell whether a character might start something. Need to build it dynamically based on the Markup Shorthands
options, but it won't change over the course of a document.
I'll need to separate out the current shorthands into an opening string, a closing string, and an internal regex.
The regex will match against the combined text content of the text and elements between the opening and closing string. I can then use the character offsets to properly chop them up and extract mixed text/element content when necessary.
More detailed plan:
All shorthands consist of a literal-text segment (which must live entirely in one text node of the HTML) and one or more alternating body segments (which are mixed text and markup) and literal segments (which must live in a single text node with the same parent as the start, but might be a different text node after some child elements).
Implement shorthand-recognizers as classes:
respond()
method which takes the result of the last processing: a match result and possibly a DOM sequence (if the last processing was for a body segment)respond()
must return one of four results:
On the Bikeshed side, I first collect all the shorthand recognizers that are turned on, and grab their starting regexes.
Then do a tree-walk. On each element, starting with its first text node, execute all the regexes, and collect the one (or more, if tied) that match earliest in the text node.
For each matching regex, begin the match process.
While a match is working:
.respond()
on the matcher with the regex result..respond()
and continue.When you've exhausted all the text in an element, recurse into its children (which it may have more of now, due to successful matchers).
SUCCESS
5ffb7eb1a4b4260a0f0ab8d70a107693f9b8f827 currently acts the same way as it did before, but it has inactive code that, when turned on, correctly handles nesting in biblio shorthands.
Converting the rest of the shorthands is just a mechanical issue now, then I'll flip on the new code.
Right now I do text replacements by iterating over the strings in the document, looking for things to replace in each individual string one at a time. This sucks - it means that you can't put an element inside of a text shorthand, and nesting text shorthands only works if your nesting happens to match the arbitrary order I do the processing in (so the outer one is recognized first, then I try to match on the inner text and find the inner one). It's also based on some pretty simplistic regexes, while some features (like Markdown emphasis) have more complicated rules.
I think instead I need to do character-by-character analysis of an element's top-level text, looking for the start of a replacement. When I find one, it can try to find its end, either in the same text node or in later ones at the same level. If it fails, we back out and continue looking; if it succeeds, we create a new element accordingly and continue searching within the leftover text. Then, descend into children (some of which may have been created by this process).