speced / bikeshed

:bike: A preprocessor for anyone writing specifications that converts source files into actual specs.
https://speced.github.io/bikeshed
Creative Commons Zero v1.0 Universal
1.11k stars 200 forks source link

Switch inline text replacements to a better method #832

Open tabatkins opened 8 years ago

tabatkins commented 8 years ago

Right now I do text replacements by iterating over the strings in the document, looking for things to replace in each individual string one at a time. This sucks - it means that you can't put an element inside of a text shorthand, and nesting text shorthands only works if your nesting happens to match the arbitrary order I do the processing in (so the outer one is recognized first, then I try to match on the inner text and find the inner one). It's also based on some pretty simplistic regexes, while some features (like Markdown emphasis) have more complicated rules.

I think instead I need to do character-by-character analysis of an element's top-level text, looking for the start of a replacement. When I find one, it can try to find its end, either in the same text node or in later ones at the same level. If it fails, we back out and continue looking; if it succeeds, we create a new element accordingly and continue searching within the leftover text. Then, descend into children (some of which may have been created by this process).

tabatkins commented 6 years ago

So high-level design:

  1. I need to find the possible start of a shorthand in something's text content.
  2. Then, searching in the same element's text nodes only, look for the possible end of the shorthand.
  3. Then, collect all the text between the possible start and possible end, and verify that it's a valid internal syntax.
  4. If it's valid, split it apart into chunks as appropriate. Maintain the markup you see in markup-allowing chunks (like the display text of most shorthands).
  5. Then convert it to an element.

Step 1 should be searching over the possible-starts for all the turned-on shorthands at the same time; I cannot execute a full-document walk for every single one.

As I note in the OP, I don't need to do any crazy descent-tracking; I just look at a single element's text nodes at a time. When something gets' recognized, it consumes some chunk of the element's text nodes and child elements, then I continue on after that point looking for more. Then when the element's text is fully consumed, I do a fresh walk over child elements (because the list might have changed) and recurse.

So I think for the simultaneous-design, I need to put together a parse-trie, so I can quickly and easily tell whether a character might start something. Need to build it dynamically based on the Markup Shorthands options, but it won't change over the course of a document.

I'll need to separate out the current shorthands into an opening string, a closing string, and an internal regex.

The regex will match against the combined text content of the text and elements between the opening and closing string. I can then use the character offsets to properly chop them up and extract mixed text/element content when necessary.

tabatkins commented 4 years ago

More detailed plan:


All shorthands consist of a literal-text segment (which must live entirely in one text node of the HTML) and one or more alternating body segments (which are mixed text and markup) and literal segments (which must live in a single text node with the same parent as the start, but might be a different text node after some child elements).

Implement shorthand-recognizers as classes:


On the Bikeshed side, I first collect all the shorthand recognizers that are turned on, and grab their starting regexes.

Then do a tree-walk. On each element, starting with its first text node, execute all the regexes, and collect the one (or more, if tied) that match earliest in the text node.

For each matching regex, begin the match process.

While a match is working:

When you've exhausted all the text in an element, recurse into its children (which it may have more of now, due to successful matchers).

tabatkins commented 4 years ago

SUCCESS

5ffb7eb1a4b4260a0f0ab8d70a107693f9b8f827 currently acts the same way as it did before, but it has inactive code that, when turned on, correctly handles nesting in biblio shorthands.

Converting the rest of the shorthands is just a mechanical issue now, then I'll flip on the new code.