Switch inline text replacements to a better method

tabatkins commented 8 years ago

Right now I do text replacements by iterating over the strings in the document, looking for things to replace in each individual string one at a time. This sucks - it means that you can't put an element inside of a text shorthand, and nesting text shorthands only works if your nesting happens to match the arbitrary order I do the processing in (so the outer one is recognized first, then I try to match on the inner text and find the inner one). It's also based on some pretty simplistic regexes, while some features (like Markdown emphasis) have more complicated rules.

I think instead I need to do character-by-character analysis of an element's top-level text, looking for the start of a replacement. When I find one, it can try to find its end, either in the same text node or in later ones at the same level. If it fails, we back out and continue looking; if it succeeds, we create a new element accordingly and continue searching within the leftover text. Then, descend into children (some of which may have been created by this process).

tabatkins commented 6 years ago

So high-level design:

I need to find the possible start of a shorthand in something's text content.
Then, searching in the same element's text nodes only, look for the possible end of the shorthand.
Then, collect all the text between the possible start and possible end, and verify that it's a valid internal syntax.
If it's valid, split it apart into chunks as appropriate. Maintain the markup you see in markup-allowing chunks (like the display text of most shorthands).
Then convert it to an element.

Step 1 should be searching over the possible-starts for all the turned-on shorthands at the same time; I cannot execute a full-document walk for every single one.

As I note in the OP, I don't need to do any crazy descent-tracking; I just look at a single element's text nodes at a time. When something gets' recognized, it consumes some chunk of the element's text nodes and child elements, then I continue on after that point looking for more. Then when the element's text is fully consumed, I do a fresh walk over child elements (because the list might have changed) and recurse.

So I think for the simultaneous-design, I need to put together a parse-trie, so I can quickly and easily tell whether a character might start something. Need to build it dynamically based on the Markup Shorthands options, but it won't change over the course of a document.

I'll need to separate out the current shorthands into an opening string, a closing string, and an internal regex.

The regex will match against the combined text content of the text and elements between the opening and closing string. I can then use the character offsets to properly chop them up and extract mixed text/element content when necessary.

tabatkins commented 4 years ago

More detailed plan:

All shorthands consist of a literal-text segment (which must live entirely in one text node of the HTML) and one or more alternating body segments (which are mixed text and markup) and literal segments (which must live in a single text node with the same parent as the start, but might be a different text node after some child elements).

Implement shorthand-recognizers as classes:

advertises its starting literal-text segment as a regex hanging off the class
has a respond() method which takes the result of the last processing: a match result and possibly a DOM sequence (if the last processing was for a body segment)
if they have at least one body segment, they track what "phase" they're in themselves.
respond() must return one of four results:
- "ah, I realize I have more literal text to recognize", with a regex for the continued literal segment
- "next segment is body text", with a regex for the next literal segment (possibly the ending segment, for most)
- "I'm done", with a DOM sequence representing the completion
- "Whoops I don't match after all"

On the Bikeshed side, I first collect all the shorthand recognizers that are turned on, and grab their starting regexes.

Then do a tree-walk. On each element, starting with its first text node, execute all the regexes, and collect the one (or more, if tied) that match earliest in the text node.

For each matching regex, begin the match process.

if the match ends up failing, try the next one that tied
if there are no more tied matches, throw away the text preceding and including the first character of the match, and start matching again.

While a match is working:

call .respond() on the matcher with the regex result.
if it returns another literal segment, continue matching from the end of the previous match. Fail if it doesn't match anything in that text node.
if it says it has a body segment, start trying to match its next regex in this text node, starting from the end of the previous match. If you reach the end of the text node, skip to the next text node sibling and continue. If you reach the end of the parent element, fail the match.
otherwise you've found the next literal segment. Collect all the text and markup you've skipped over, then hand it and the regex result to .respond() and continue.
when the match is finished, replace the text and markup you've consumed up to that point with the result. Then restart matching in the text node following it.

When you've exhausted all the text in an element, recurse into its children (which it may have more of now, due to successful matchers).

tabatkins commented 4 years ago

SUCCESS

5ffb7eb1a4b4260a0f0ab8d70a107693f9b8f827 currently acts the same way as it did before, but it has inactive code that, when turned on, correctly handles nesting in biblio shorthands.

Converting the rest of the shorthands is just a mechanical issue now, then I'll flip on the new code.

speced / bikeshed

Switch inline text replacements to a better method #832