jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.94k stars 2.19k forks source link

Jsoup fails to parse HTML with trailing new lines. #258

Closed lukepfarrar closed 11 years ago

lukepfarrar commented 12 years ago

Hi,

Parsing "<html><body><div><p><a id=\"theId\" /></p></div></body></html>\n" will return a Document with two <a id=\"theId\" /> elements.

Hope this helps, Luke

lukepfarrar commented 12 years ago

(This was about self closing anchor tags, not newlines btw).

jhy commented 11 years ago

The current behavior is correct according to the HTML5 spec and other current browsers (Chrome, Safari, etc).

This happens because <a> tags cannot be self closing (nor can <div>s). The newline after the </html> tag is treated as a textnode, and promoted into the body element. Because the <a> within the <p> is still open, but the <p> is closed, the <a> is treated as an active formatting element, and applied to subsequent text.

You can check this in other browsers. Jsoup aims to output the same parsed DOM as current HMTL5 browsers, to minimise surprises, so I am going to err on the spec here.

jhy commented 11 years ago

I've thought about this a bit more and can't identify a problem with dealing with self-closing <div> and <a> tags etc. It's out of the spec, but I can't see a case where it will cause an undesired parse tree; conversely the current situation where browsers create a parse tree that is clearly not what the HTML author wanted is a bad outcome.

So I've modified the tree builder to allow all tags to self close. Defined tags that "shouldn't" self close (like div and a) will output with an end tag, so that the output is safe according to the spec.

I've taken a different implementation approach to yours, and implemented in the tree builder and not the tokenizer, so that it works for all tags.

jhy commented 11 years ago

c3c952e55f10b07dd9d4a9121db1b3828b0a1bc7

bouncysteve commented 11 years ago

In the past I encountered a problem where self-closing Githubissues.

  • Githubissues is a development platform for aggregating issues.