Greater than characters in attribute values causes tag parsing loop limit error

michal-h21 commented 7 years ago

I am working on script which tries to extract various metadata from web pages and it happens quite often that it reports the following error:

HTMLParser] [ERR] tag parsing loop reached loop limit (1000). Please, consider increasing it or check the code for errors

I've found that it happens when attributes contain > character, like in this example from MDN:

<a title='The &lt;pattern> element defines a graphics object which can be redrawn at repeated x and y-coordinate intervals ("tiled") to cover an area.' href="/en-US/docs/Web/SVG/Element/pattern"><code>&lt;pattern&gt;</code></a>

This issue can happen also when an attribute contains some JS code:

<a href="#" onmousedown="if(!document.querySelector) return; var link = document.querySelector('link[rel=canonical]'); if(link) this.href = link.href + (link.href.indexOf('?') > -1 ? '&' : '?') + 'setver=touch';" class="f-ico" rel="nofollow">XXX</a>

It seems the issue is that HTML tag parsing stops on the first >, which then result in html attribute without end quote. The attribute parsing code then stuck in an infinite loop until loop limit is reached.

msva commented 7 years ago

what version do you use? AFAIRC, I've already fixed that

michal-h21 commented 7 years ago

I manually installed the master branch from Github. This is an mwe showing the issue:

local htmlparser = require("htmlparser")

local htmlstring = [[
<p>
<a id="unclosed>Element"> with unclosed attribute</a>
</p>
]]
local root = htmlparser.parse(htmlstring)

It produces the following output on my machine:

[HTMLParser] [ERR] tag parsing loop reached loop limit (1000). Please, consider increasing it or check the code for errors

msva commented 7 years ago

Well, I fixed /similar/ case (with templaters), which used <%/%>. Actually, identifiers, containing '>' is not valid in HTML, AFAIK.

Anyway, I'll try to do something with that case too :-/

msva commented 7 years ago

can you check your workflow with the latest commit in master?

michal-h21 commented 7 years ago

Sorry for the late reply and thanks for a quick fix.

I think that quoted attribute can contain > value. The HTML5 syntax lists it as prohibited only for unquoted attribute values. I understand that this makes it much harder to correctly parse attributes.

The latest commit works for the provided sample, but I found that it still may fail in little bit different cases:

local htmlparser = require("htmlparser")
local htmlstring = [[
<p>
<a id="unclosed>Element"> with unclosed attribute</a>
<a id="unclosed> Element"> with unclosed attribute</a>
<a id="unclosed>Element and one>more"> with unclosed attribute</a>
</p>
]]
local root = htmlparser.parse(htmlstring)
local a = root:select("a")
for _, x in ipairs(a) do
  print("title",  x.attributes["id"])
end

[HTMLParser] [ERR] tag parsing loop reached loop limit (10). Please, consider increasing it or check the code for errors[HTMLParser] [ERR] tag parsing loop reached loop limit (10). Please, consider increasing it or check the code for errors title unclosed>Element title title

As you can see, it fails when > is followed by a space, or when the attribute contains more than one >. It also seems that example with space after > works with 2e2f306, but not with 6f4c5a1.

msva / lua-htmlparser

Greater than characters in attribute values causes tag parsing loop limit error #50