Closed lukepfarrar closed 11 years ago
(This was about self closing anchor tags, not newlines btw).
The current behavior is correct according to the HTML5 spec and other current browsers (Chrome, Safari, etc).
This happens because <a>
tags cannot be self closing (nor can <div>
s). The newline after the </html>
tag is treated as a textnode, and promoted into the body element. Because the <a>
within the <p>
is still open, but the <p>
is closed, the <a>
is treated as an active formatting element, and applied to subsequent text.
You can check this in other browsers. Jsoup aims to output the same parsed DOM as current HMTL5 browsers, to minimise surprises, so I am going to err on the spec here.
I've thought about this a bit more and can't identify a problem with dealing with self-closing <div>
and <a>
tags etc. It's out of the spec, but I can't see a case where it will cause an undesired parse tree; conversely the current situation where browsers create a parse tree that is clearly not what the HTML author wanted is a bad outcome.
So I've modified the tree builder to allow all tags to self close. Defined tags that "shouldn't" self close (like div and a) will output with an end tag, so that the output is safe according to the spec.
I've taken a different implementation approach to yours, and implemented in the tree builder and not the tokenizer, so that it works for all tags.
c3c952e55f10b07dd9d4a9121db1b3828b0a1bc7
In the past I encountered a problem where self-closing tags were being replaced with the opening tag only. Now that I've upgraded to 1.7.2 the closing tag is added in the correct place, as per the above fix, but the rest of the document is now html encoded, (so no further tags are detected).
Raised as #305
Hi,
Parsing
"<html><body><div><p><a id=\"theId\" /></p></div></body></html>\n"
will return a Document with two<a id=\"theId\" />
elements.Hope this helps, Luke