oalders / html-restrict

HTML::Restrict - Strip away unwanted HTML tags
Other
10 stars 9 forks source link

</br> tag allowed #24

Closed benkasminbullock closed 5 years ago

benkasminbullock commented 7 years ago

The HTML <br> tag has no end tag. The following example

use warnings;
use strict;
use utf8;
my $snippet = q{
    <P STYLE="font-size: 300%"><BLINK>"You may get to touch her<BR>
    If your gloves are sterilized<BR></BR>
    Rinse your mouth with Listerine</BR>
    Blow disinfectant in her eyes"</BLINK><BR>
    -- X-Ray Spex, <I>Germ-Free Adolescents<I>
    <SCRIPT>alert('!!');</SCRIPT>
};
use HTML::Restrict '2.2.3';
my $stuff = HTML::Restrict->new (
    trim => 0,
    rules => {
        b   => [],
        img => [qw( src alt / )],
    br => [],
    },
);
print $stuff->process ($snippet);

results in

"You may get to touch her<br>
If your gloves are sterilized<br></BR>
Rinse your mouth with Listerine</BR>
Blow disinfectant in her eyes"<br>
-- X-Ray Spex, Germ-Free Adolescents

Here the closing </BR> tags have not been removed by HTML::Restrict, but there is no such thing as a closing br tag in HTML.

Edit: sorry, I should have said "not been removed" here.

oalders commented 7 years ago

I wonder if we can blame HTML::Parser here? HTML::Restrict never claimed to fix bad HTML, only to remove valid tags which are unwanted.

Having said that, I can see an argument that says this module should remove anything that looks like a tag if it hasn't specifically been whitelisted.

benkasminbullock commented 7 years ago

Strangely enough,
produces the same effect as a
on browsers, so if you don't strip
tags out in HTML::Restrict, you're allowing a way for tricky people to add tags which show up in browsers but aren't removed by your module. Assuming there is some way for them to add style or javascript tags on to the
tag, that might even be a security flaw. However, I don't think there is very much valid JavaScript that can be added to
tags anyway, so it might not be worth worrying about.

By the way, I forgot to point out that the example code I sent comes from the HTML::Laundry module's synopsis.

On 28 February 2017 at 12:03, Olaf Alders notifications@github.com wrote:

I wonder if we can blame HTML::Parser here? HTML::Restrict never claimed to fix bad HTML, only to remove valid tags which are unwanted.

Having said that, I can see an argument that says this module should remove anything that looks like a tag if it hasn't specifically been whitelisted.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/oalders/html-restrict/issues/24#issuecomment-282928065, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGOdZXVjTMQIwlexxbJcdxI4b4Hp2Qdks5rg46GgaJpZM4MMMpT .

oalders commented 7 years ago

I don't know what the security implications are here, but if there's a relatively simple way to do this, I'd be inclined to accept a patch for it.

benkasminbullock commented 7 years ago

Looking at it a bit more clearly, I don't think there are any security implications since there cannot be style or javascript attributes on the closing tag. Also, browsers allow this </br> tag but don't allow any attributes on it. Further, if <br> is not allowed in HTML::Restrict, then </br> is also deleted, so there is not really any bug here.

If one wanted to do something about that, there is a rule in HTML::Tagset called %HTML::Tagset::emptyElement which can be used to warn about elements like this, if necessary, but since the browser doesn't disallow them, and actively displays </br> as if it was <br>, I don't think it's really necessary to do anything here, except perhaps add a test of the current behaviour. If you did want to add the warning, HTML::Tagset is already depended on by HTML::Parser, so adding it as a dependence wouldn't make much difference. Anyway I don't think this is a real problem.

oalders commented 7 years ago

Thanks for doing the research. I will leave this ticket open in case anyone wants to pick this up. Having said that, it looks like HTML::Tagset has mostly been abandoned.

benkasminbullock commented 7 years ago

HTML::Tagset seems abandoned, which is why I made the following module as a drop-in replacement for it, updated to HTML 5:

https://metacpan.org/pod/distribution/HTML-Valid/lib/HTML/Valid/Tagset.pod

Unfortunately it depends on tidy-html5:

https://github.com/htacg/tidy-html5

which is a rather "interesting" project.

oalders commented 7 years ago

BTW, I saw your review on HTML::Strip. I'd be open to adding a pretty mode on HTML::Restrict that turns <br> tags into a newline and perhaps turns a </p> into two newlines.

I wrote this module primarily to strip basically all HTML from form data, but I can see how it would be annoying to see this behaviour with <br>.

I've opened https://github.com/oalders/html-restrict/issues/25 to track this.

oalders commented 5 years ago

We don't think there's a bug here and #25 has been opened to track prettifying output, so I think it's ok to close this for now. Thanks for raising this @benkasminbullock.