Wrong parse [rt.cpan.org #55629]

Migrated from rt.cpan.org#55629 (status was 'open')

Requestors:

NIKOLAS@cpan.org

From nikolas@cpan.org on 2010-03-16 15:09:51 :

HTML:
<iframe/**/src="http://mail.ru"  name="poc iframe jacking"  width="100%"
height="100%" scrolling="auto" frameborder="no"></iframe>

$parser = HTML::Parser->new(
 api_version => 3,
 start_h => [ sub{
   my ($Self, $Text, $Tag, $Attr) = @_;
   print "Tag is: ".$Tag;
 }, "self, text, tagname, attr" ]
);
$parser->ignore_elements( qw( iframe ));
$parser->ignore_tags( qw( iframe ));

output:
Tag is: iframe/**/src="http://mail.ru"

From nikolas@cpan.org on 2010-03-18 13:51:31 :

ÐÑÑ ÐÐ°Ñ 16 11:09:51 2010, NIKOLAS Ð¿Ð¸ÑÐ°Ð»:
> HTML:
> <iframe/**/src="http://mail.ru"  name="poc iframe jacking"  width="100%"
> height="100%" scrolling="auto" frameborder="no"></iframe>
> 
> $parser = HTML::Parser->new(
>  api_version => 3,
>  start_h => [ sub{
>    my ($Self, $Text, $Tag, $Attr) = @_;
>    print "Tag is: ".$Tag;
>  }, "self, text, tagname, attr" ]
> );
> $parser->ignore_elements( qw( iframe ));
> $parser->ignore_tags( qw( iframe ));
> 
> output:
> Tag is: iframe/**/src="http://mail.ru"

HTML: <script/src="ya.ru"> wrong parse same

From gaas@cpan.org on 2010-04-04 20:38:08 :

I don't understand what rules you propose that HTML::Parser should follow to parse this kind of 
bogus HTML.  You think it should treat "/**/" and "/" as whitespace?

From nikolas@cpan.org on 2010-06-01 07:13:54 :

Here 3 regular expressions applied to the entrance text correct this
problems:
s{(/\*)}{ $1}g;
s{(\*/)}{$1 }g;
s{(<[^/\s<>]+)/}{$1 /}g;

Probably you will find more correct architectural decision.

libwww-perl / HTML-Parser

Wrong parse [rt.cpan.org #55629] #12