tree-sitter / tree-sitter-html

HTML grammar for Tree-sitter
MIT License
136 stars 72 forks source link

Add support for parsing entities (with tests) #50

Closed savetheclocktower closed 1 year ago

savetheclocktower commented 1 year ago

#45 is just what I needed, so I wrote some tests to get it across the finish line. While doing so, I noticed that hexadecimal entities weren't correctly parsed if they contained an A-F digit, so I wrote a clumsy version of isxdigit to get around tree-sitter#949. The implementation no longer needs an external rule.

This would close #10 at least partially — it depends on whether we think entities should also be recognized within attribute values. I think that would be useful, but it's a non-trivial addition, so I didn't want it to hold up the good work done in #45.

maxbrunsfeld commented 1 year ago

Thanks for the PR! Could you explain why entities need to be parsed via the external scanner, instead of just a regex in the grammar? It doesn't seem like they would require lookahead or state, but I might be missing something.

savetheclocktower commented 1 year ago

Could you explain why entities need to be parsed via the external scanner, instead of just a regex in the grammar?

I can't, because my goal was just to adapt the work of #45, but it's a fair question. Maybe @logancollins remembers the rationale. If I get time tomorrow I'll see if I can get the tests passing without the external rule.

savetheclocktower commented 1 year ago

@maxbrunsfeld, let me run this by you and see if it seems sound.

Instead of the external, here are two options that seem to be working:

entity: $ => /&(#(x[0-9a-fA-F]+|[0-9]+)|[A-Za-z]+);/,
entity: $ => seq(
  '&',
  choice(
    /[A-Za-z]+/,      // named entity
    /#[0-9]+/,        // decimal entity
    /#x[a-fA-F0-9]+/  // hexadecimal entity
  ),
  ';'
),

The second option seems more tree-sitter-esque, and I think it behaves a bit better in an error state. But in the following example…

<p>Lorem ipsum &#xA00f; dolor &#0z0; sit amet &nbsp;.</p>

…the invalidity of the middle entity seems to consume the text that comes after it:

fragment [0, 0] - [1, 0]
  element [0, 0] - [0, 57]
    start_tag [0, 0] - [0, 3]
      tag_name [0, 1] - [0, 2]
    text [0, 3] - [0, 14]
    entity [0, 15] - [0, 23]
    text [0, 24] - [0, 29]
    ERROR [0, 30] - [0, 45]
      ERROR [0, 30] - [0, 45]
    entity [0, 46] - [0, 52]
    text [0, 52] - [0, 53]
    end_tag [0, 53] - [0, 57]
      tag_name [0, 55] - [0, 56]

This happens with both the externals approach and my preferred approach above. Not a deal-breaker, but if there’s an easy way to get it to recover more quickly and minimize the size of the error node, I’d much prefer it.

I also haven’t found good ways to put reasonable limits on how long an entity can get (&#214; is valid; &#200000002314; surely isn’t) — quantifiers like /#[0-9]{1,4}/ don’t seem to have an effect in the regexes above. Again, if there’s an obvious fix I’m not seeing, then do let me know. Otherwise I’ll open a new PR without the external and it can get refined by future contributors after it lands.

maxbrunsfeld commented 1 year ago

I'm fine with either option, but with the second solution, you should put a token() around the whole rule, so that it is all parsed as one token, as opposed to the & and ; be parsed as separate tokens from the numeric code. I think regex quantifiers are implemented, so you should be able to say {1,4}, but maybe there's some bug in the handling of those that I'm not aware of.

savetheclocktower commented 1 year ago

OK, went with the single-regex option after all. Whatever I was seeing last night isn't happening today — the quantifiers worked just fine, too.

logancollins commented 1 year ago

Could you explain why entities need to be parsed via the external scanner, instead of just a regex in the grammar?

If I remember correctly, this was mostly because of how much of the HTML grammar was already implemented in the scanner. It seemed to be so for speed, so I assumed it was a good place to do it to keep that consistency. I don't really mind either way where it ends up implemented.

Also, we've since added the full hexadecimal support to our fork you mentioned, but it hadn't yet made it to our GitHub PR here. Either way, happy to see this!

maxbrunsfeld commented 1 year ago

Oh, thanks for the explanation @logancollins. It looks like as written, this regex supports hexadecimal entities. Are we still missing anything there?

logancollins commented 1 year ago

Yep! Sorry, I was referring to my original submission, which didn't. I think the only thing that might be missing from the regex implementation is support for X vs. x for hex, as IIRC the HTML spec supports both, unless the regex is case-insensitive here.

maxbrunsfeld commented 1 year ago

It looks like the regex includes the uppercase X; it's different than in @savetheclocktower's original comment.

logancollins commented 1 year ago

Aha! My mistake, yeah.

maxbrunsfeld commented 1 year ago

Thank you both for this PR!