Open savetheclocktower opened 1 year ago
lezer-parser-html does this already
<a t="a&b">a&b</a>
node 15 = Document: '<a t="a&b">a&b</a>\n'
node 20 = Element: '<a t="a&b">a&b</a>'
node 36 = OpenTag: '<a t="a&b">'
node 6 = StartTag: '<'
node 22 = TagName: 'a'
node 23 = Attribute: 't="a&b"'
node 24 = AttributeName: 't'
node 25 = Is: '='
node 26 = AttributeValue: '"a&b"'
node 17 = EntityReference: '&'
node 4 = EndTag: '>'
node 16 = Text: 'a'
node 17 = EntityReference: '&'
node 16 = Text: 'b'
node 37 = CloseTag: '</a>'
node 11 = StartCloseTag: '</'
node 22 = TagName: 'a'
node 4 = EndTag: '>'
node 16 = Text: '\n'
lezer-parser-html parses the &
in <a href="?a=1&b=2">
as InvalidEntity
ideally there should be 2 tokens: EntityReference and EntityReferenceInAttributeValue so in a semantic stage i can ignore only EntityReferenceInAttributeValue
Support for HTML entities was requested in #10, and was mostly addressed in #50, but I think it's reasonable to want entities to be recognized inside of attribute values as well.
This is a trickier request because
attribute_value
is currently a simple node that does not have any children and doesn't envision being broken up by tokens with special meanings. An entity is roughly equivalent to anescape_sequence
node in other tree-sitter parsers, but those parsers tend to represent a string's contents as a series ofstring_content
andescape_sequence
nodes.So the most intuitive solution might be to introduce a
string_content
node (orattribute_value_content
or something), and make it so thatattribute_value
's children are some combination ofstring_content
andentity
nodes. By and large I think it wouldn't disrupt existing consumers oftree-sitter-html
.The only exception I can think of is injections — since
injection.include-children
isfalse
by default, anyone injecting intoattribute_value
nodes would no longer see any content inside them until they change that setting.Another option would be to do something like what
tree-sitter-javascript
does for template strings: make it so thatattribute_value
can containentity
nodes, but don't represent the non-entity text content ofattribute_value
with any sort of node. In this scenario, injections intoattribute_value
would at least still see all the non-entity content of the value wheninclude-children
isfalse
. This might be more surprising behavior because it runs contrary to how we handle entities in tag contents (entity
nodes break uptext
nodes), but maybe folks might feel it's less disruptive.