Closed michal-h21 closed 7 years ago
what version do you use? AFAIRC, I've already fixed that
I manually installed the master
branch from Github. This is an mwe
showing the issue:
local htmlparser = require("htmlparser")
local htmlstring = [[
<p>
<a id="unclosed>Element"> with unclosed attribute</a>
</p>
]]
local root = htmlparser.parse(htmlstring)
It produces the following output on my machine:
[HTMLParser] [ERR] tag parsing loop reached loop limit (1000). Please, consider increasing it or check the code for errors
Well, I fixed /similar/ case (with templaters), which used <%
/%>
.
Actually, identifiers, containing '>' is not valid in HTML, AFAIK.
Anyway, I'll try to do something with that case too :-/
can you check your workflow with the latest commit in master?
Sorry for the late reply and thanks for a quick fix.
I think that quoted attribute can contain >
value. The HTML5 syntax lists it as prohibited only for unquoted attribute values. I understand that this makes it much harder to correctly parse attributes.
The latest commit works for the provided sample, but I found that it still may fail in little bit different cases:
local htmlparser = require("htmlparser")
local htmlstring = [[
<p>
<a id="unclosed>Element"> with unclosed attribute</a>
<a id="unclosed> Element"> with unclosed attribute</a>
<a id="unclosed>Element and one>more"> with unclosed attribute</a>
</p>
]]
local root = htmlparser.parse(htmlstring)
local a = root:select("a")
for _, x in ipairs(a) do
print("title", x.attributes["id"])
end
[HTMLParser] [ERR] tag parsing loop reached loop limit (10). Please, consider increasing it or check the code for errors[HTMLParser] [ERR] tag parsing loop reached loop limit (10). Please, consider increasing it or check the code for errors title unclosed>Element title title
As you can see, it fails when >
is followed by a space, or when the attribute contains more than one >
. It also seems that example with space after >
works with 2e2f306, but not with 6f4c5a1.
I am working on script which tries to extract various metadata from web pages and it happens quite often that it reports the following error:
I've found that it happens when attributes contain
>
character, like in this example from MDN:This issue can happen also when an attribute contains some JS code:
It seems the issue is that HTML tag parsing stops on the first
>
, which then result in html attribute without end quote. The attribute parsing code then stuck in an infinite loop until loop limit is reached.