Alternative approach to nested parsing?

ashtonsix commented 2 years ago

Hi, I noticed this HTML snippet:

<script>var x = '</script>'; var y = 'blah';</script>

Produces this parse tree:

Actual Parse Tree

```txt Document( Element( OpenTag( StartTag("<") TagName("script") EndTag(">") ) ScriptText("var x = '") CloseTag( StartCloseTag("") ) ) Text("'; var y = 'blah';") MismatchedCloseTag( StartCloseTag("") ) ) ```

When I would have instead expected a tree like this:

Expected Parse Tree

```txt Document( Element( OpenTag( StartTag("<") TagName("script") EndTag(">") ) ScriptText("var x = ''; var y = 'blah';") CloseTag( StartCloseTag("") ) ) ) ```

What about, inside tokens.js:contentTokenizer, starting a new inner parse, and then shifting a content token to the outer HTML parser when the inner parser cannot shift anything AND the outer parser can shift. We would then save the result of the inner parse for later and mount it once the outer parse completes.

Could this be a good way to approach mixed-language parsing in general?

I have some ideas around context-sensitive languages and modular parsers I think would be neat to explore with this approach, but am not 100% sure they have legs yet.

marijnh commented 2 years ago

When I would have instead expected a tree like this:

Have you tried this in a browser? Because I'm pretty sure the way browsers parse documents like this corresponds to what you are labeling the 'bad' parse tree.

ashtonsix commented 2 years ago

Ah, I just tried it in a browser and you're totally right. My apologies for taking your time with this.

lezer-parser / html

Alternative approach to nested parsing? #2