isaacs / sax-js

A sax style parser for JS
Other
1.09k stars 325 forks source link

fix: ignore unparsed entity logic for predefined xml entities #266

Closed SethFalco closed 3 months ago

SethFalco commented 8 months ago

There was a bug in the implementation for resolving entities. Predefined XML entities like <, &, and " should not be interpreted, as this breaks parsing of the document.

For example, see the following XML document (an SVG) and render in the browser:

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
<!ENTITY entity_reference "<text x='4' y='64'>entity reference</text>">
<!ENTITY escaped_entity_reference "&lt;text&gt;escaped entity reference&lt;/text&gt;">
]>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 300 128">
  <text x="4" y="32">
    &lt;text&gt;escaped literal&lt;/text&gt;
  </text>
  &entity_reference;
  <text x="4" y="96">
    &escaped_entity_reference;
  </text>
</svg>

We should resolve custom entities like entity_reference and escaped_entity_reference. However, not &amp, &lt;, or &#34;, otherwise they get interpreted by the parser instead.

Currently, when parsing this XML document in sax v1.3.0 we get this error:

Error: Invalid character in tag name
Line: 6
Column: 14
Char: &
    at error (/home/seth/Documents/repos/maintainer/svg/svgo/node_modules/sax/lib/sax.js:652:10)
    at strictFail (/home/seth/Documents/repos/maintainer/svg/svgo/node_modules/sax/lib/sax.js:678:7)
    at SAXParser.write (/home/seth/Documents/repos/maintainer/svg/svgo/node_modules/sax/lib/sax.js:1284:17)

Related


Just an aside, if you'd like some help with maintaining sax, I'd love to lend a hand. We've been using this in SVGO, and since joining as a maintainer I've been meaning to move away from our (unmaintained) fork back to upstream. No pressure, and I'll see if I can open other pull requests in time.

If you'd accept it, I think it would be very valuable to have a task in CI that attempts to parse the XML Conformance Test Suites just to ensure no errors are thrown. That also would've caught this bug. ^-^'

https://www.w3.org/XML/Test/

Update

Since opening this pull request, I threw together a script for regression testing.

- name: Regression
  run: |
    wget https://www.w3.org/XML/Test/xmlts20130923.tar.gz
    tar -tf xmlts20130923.tar.gz | grep -E '^xmlconf/xmltest/valid/sa/[0-9]+\.xml$' > filter.txt
    tar -xf xmlts20130923.tar.gz -T filter.txt
    FAILURES=0
    for FILE in $(grep -REL xmlconf/ -e '<!ENTITY\s+(\S+)\s+"')
    do
      if [ $(file -bi $FILE | sed -e "s/.* charset=//") == 'utf-16le' ]; then echo "$(iconv -f utf-16le -t utf-8 $FILE)" > $FILE; fi
      node -e "require('./lib/sax').parser(true).write(require('fs').readFileSync('$FILE', { encoding: 'utf8' })).close();" \
        || echo "$FILE failed" && ((FAILURES+=1))
      node -e "require('./lib/sax').parser(true, { unparsedEntities: true }).write(require('fs').readFileSync('$FILE', { encoding: 'utf8' })).close();" \
        || echo "$FILE failed with { unparsedEntities: true }" && ((FAILURES=+1))
    done
    if [ FAILURES != 0 ]; then exit 1; fi

This actually helped me identify a problem with my initial solution. I didn't account for the fact that &quot; isn't the only way to insert quotes. One can do &#34; instead, so now I'm comparing against Object.values(XML_ENTITIES) rather than checking if the entity string exists as a property of XML_ENTITIES.

SethFalco commented 5 months ago

@isaacs Sorry to ping, but I was wondering if you had time to review my three pull requests to this repo? I'd be happy to sponsor $100 via GitHub Sponsors if we can get all three reviewed and hopefully merged/released?

These would all help close issues in SVGO, which depends on sax-js.