Closed GoogleCodeExporter closed 9 years ago
For extra fun try scanning this
string:"<UL><SUP><SUB><STRONG><SPAN><SMALL><PRE><P><OL><LI><I><H6><H5><H4><H3><H
2><H1><HR><EM><FONT><DD><DT><DL><CODE><BR><BLOCKQUOTE><BIG><B><ADDRESS><A>filler
text</A></ADDRESS></B></BIG></BLOCKQUOTE></BR></CODE></DL></DT></DD></FONT></EM>
</HR></H1></H2></H3></H4></H5></H6></I></LI></OL></P></PRE></SMALL></SPAN></STRO
NG></SUB></SUP></UL>"
The cleansed version on 1.4.1 and 1.4.2 are this:
<ul><sup><sub><strong><span><pre><p><small><ol><li /></ol></small></p><h1><hr
/><em><font><dt /></font></em></h1></pre></span></strong></sub></sup></ul>
Errors reported are this:
[The small tag was empty, and therefore we could not process it. The rest of
the message is intact, and its removal should not have any side effects.,
The small tag was empty, and therefore we could not process it. The rest of the
message is intact, and its removal should not have any side effects.,
The i tag was empty, and therefore we could not process it. The rest of the
message is intact, and its removal should not have any side effects.,
The h6 tag was empty, and therefore we could not process it. The rest of the
message is intact, and its removal should not have any side effects.,
The h5 tag was empty, and therefore we could not process it. The rest of the
message is intact, and its removal should not have any side effects.,
The h4 tag was empty, and therefore we could not process it. The rest of the
message is intact, and its removal should not have any side effects.,
The h3 tag was empty, and therefore we could not process it. The rest of the
message is intact, and its removal should not have any side effects.,
The h2 tag was empty, and therefore we could not process it. The rest of the
message is intact, and its removal should not have any side effects.,
The dd tag was empty, and therefore we could not process it. The rest of the
message is intact, and its removal should not have any side effects.,
The p tag was empty, and therefore we could not process it. The rest of the
message is intact, and its removal should not have any side effects.]
Please provide any additional information below.
Original comment by cam.morris@gmail.com
on 17 Dec 2010 at 6:04
A colleague pointed out that all the offending examples I came up with appear
to be uncommon usage. He thought that it they probably do not conform with
standard HTML. So I did some browsing through the XHTML Schema
http://www.w3.org/TR/2010/REC-xhtml-basic-20101123/#a_smodules. Sure enough
there is a massive structure of elements and what they can contain. Looking
through the tags that are offending, they contain tags they should not contain.
For example an H6 doesn't contain an H5, or vice-versa. A Q tag should not
contain an H2 tag, etc.
Original comment by cam.morris@gmail.com
on 11 Jan 2011 at 10:10
Your colleague is correct. NekoHTML (our HTML processor) will remove "illegal"
HTML, such as nested anchor tags, etc. I don't necessarily care about what the
spec thinks is illegal, so if you have a compelling and common example we can
revisit.
I think there was also a typo in your first comment - you said it worked
correctly in 1.3 - did you mean 1.4.3 or truly 1.3? Can you try it against HEAD
if you meant 1.3?
Original comment by arshan.d...@gmail.com
on 3 Feb 2011 at 8:05
Closing.
Original comment by arshan.d...@gmail.com
on 23 Feb 2011 at 1:20
Original issue reported on code.google.com by
cam.morris@gmail.com
on 17 Dec 2010 at 6:03