nim-lang / Nim

Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).
https://nim-lang.org
Other
16.54k stars 1.47k forks source link

parsexml.nim reads wild html incorrectly #23050

Open veksha opened 10 months ago

veksha commented 10 months ago

Description

I use htmlparser, but it uses parsexml under the hood. I need to parse this wild html: <a href="&" class="CCC">TTT</a> in other words: href attribute may contain URL with ampersand (just char inside URL, and not entity like &amp;)

but after parsing i get: <a href="&amp;">class=&quot;CCC&quot;&gt;TTT</a>. notice how class attribute is not an attribute anymore it is now inside innerText.

what happens: parsexml.nim sets stateAttr to parse attributes, then outputs error (1, 12) Error: ';' expected because it can't parse entity and sets state to stateError and then to stateNormal. Next attribute will not be parsed as attribute, because state is not stateAttr anymore!

minified example:

import pkg/htmlparser, xmltree, streams
var errors: seq[string]
let node = newStringStream("""<a href="&" class="CCC">TTT</a>""").parseHtml("",errors)
echo node
echo errors

Nim Version

Nim Compiler Version 2.1.1 [Windows: amd64] Compiled at 2023-11-19 Copyright (c) 2006-2023 by Andreas Rumpf

git hash: cecaf9c56b1240a44a4de837e03694a0c55ec379 active boot switches: -d:release

Current Output

<a href="&amp;">class=&quot;CCC&quot;&gt;TTT</a>
@["(1, 12) Error: \';\' expected"]

Expected Output

<a href="&" class="CCC">TTT</a>
@[]

or maybe

<a href="&amp;" class="CCC">TTT</a>
@[]

Possible Solution

No response

Additional Information

No response

metagn commented 10 months ago

XML (and HTML) does not seem to allow raw & in attribute strings (consider "), but maybe we could adjust the failing condition to leave the & verbatim.

Edit: This might need significant lookahead.