Closed brunobuss closed 11 years ago
Some more test cases:
is( $hr->process( '<<' ), '<<', 'ok' ); #This is working now. is( $hr->process( '<<a' ), '<<a, 'ok' ); #This doesn't work, return only '<'.
is( $hr->process( 'a<<' ), 'a<<, 'ok' ); #This is working now. is( $hr->process( 'a<<a' ), 'a<<a, 'ok' ); #This doesn't work, return only 'a<'.
is( $hr->process( '<a<' ), '<a<', 'ok' ); #This doesn't work, return an empty string'.
It seems as though HTML::Parser is interpreting those strings as comments. You can see this by enabling debug mode. Set allow_comments, all your test cases pass:
my $hr = HTML::Restrict->new( debug => 1, allow_comments => 1 );
The same will be true for the other sanitizing modules that are based on HTML::Parser:
use HTML::Scrubber;
my $s = HTML::Scrubber->new;
say $s->scrub('test<string');
# output: test
I recommend you try to encode unwanted HTML entities, such as '<', before processing the HTML. If that is not possible, then you could pre-process the data using a markup abstraction, such as Markdown, instead of trying to process arbitrary strings as HTML.
If I encode/escape all html entities before passing it to HTML::Restrict, it will do nothing as there is not any tags left. Don't know what I'll do to solve my problem... but I'm closing this issue, as it's clearly not a HTML::Restrict problem. Thank you for your time :+1: .
@brunoboss: Broken HTML is going to be a problem, but I guess opening an issue with HTML::Parser might be helpful. Thanks for adding those test cases to make it clear. And thanks to @perlpong for finding the problem. :)
Passing 'test<string' to process(), it return only 'test'. It should return the complete string.
Some test cases (sorry, I didn't know where to put them on the t/, this is why I'm pasting them here): is( $hr->process( '<' ), '<', 'ok' ); is( $hr->process( 'a<' ), 'a<', 'ok' ); is( $hr->process( '<a' ), '<a', 'ok' );
Also, if the number of '<' and '>' don't match, it seems like HTML::Restrict parser is using some greed strategy to get tags. For example, in 'a<s<d>b' I expected it to return "a<sb", but it returned only "ab".