libwww-perl / HTML-Parser

The HTML-Parser distribution is is a collection of modules that parse and extract information from HTML documents.
Other
6 stars 13 forks source link

Handle <unclosed </tags [rt.cpan.org #47748] #1

Open oalders opened 4 years ago

oalders commented 4 years ago

Migrated from rt.cpan.org#47748 (status was 'new')

Requestors:

From jmehnle@cpan.org on 2009-07-09 17:02:41 :

The other day, I received a spam e-mail with a text/html body part like
this:

==============================================================
blah blah<br><br
<a href=http://domain/path.html target=_blank>Go!</a><br><p>blah
==============================================================

My spam filter failed to parse the href URL from the message body due to
the unclosed "<br" tag.  Closing it causes HTML::Parser to correctly
parse the URL.

I noticed that http://search.cpan.org/dist/HTML-Parser/Parser.pm#BUGS says:

«Unclosed start or end tags, e.g. "<tt<b>...</b</tt>" are not recognized.»

I don't understand what the implication of this is, however.  Is it a
conscious decision not to support unclosed tags, or has there just been
no use case for a fix?

I tried how various browsers handle the HTML code from the spam message
above:

At least the following do render the link despite the preceding broken
"<br" tag:  Firefox 3, Konqueror from KDE 3.5.9, Safari 3 & 4, Mail.app

At least the following do NOT render the link:  IE 6, Opera 9.63

I'd appreciate it if an option could be added to HTML::Parser to
recognize unclosed tags.