oalders / html-restrict

HTML::Restrict - Strip away unwanted HTML tags
Other
10 stars 9 forks source link

Another br tag issue #26

Closed blybb closed 5 years ago

blybb commented 7 years ago

Im on version HTML::Restrict v2.2.4. I have the problem that <br/> tags are not preserved with the basic br => []' rule, only if there is a space between 'br' and '/'.

#!/usr/bin/env perl

use warnings;
use strict;

use utf8;

my $snippet = 'two element open & close break<br></br>
    one element open & close break <br />
    one element open & close break no space<br/>
';
use HTML::Restrict;
my $stuff = HTML::Restrict->new(
    trim  => 0,
    rules => {
        br  => [],
        'br/' => [],
    },
);
print $stuff->process($snippet);

Only with the rule 'br/' is the last line break preserved.

Without the rule I get this output:

    two element open & close break<br></br>
    one element open & close break <br />
    one element open & close break no space

Is that the intended solution?

oalders commented 7 years ago

So, I think the issue may be with HTML::Parser. https://rt.cpan.org/Public/Bug/Display.html?id=83570

Just turning on the "empty_element_tags" option might make the parser behave the way you expect. It might be that we should just switch the default for this option."

#!/usr/bin/env perl

use warnings;
use strict;

use utf8;

my $snippet = 'two element open & close break<br></br>
    one element open & close break <br />
    one element open & close break no space<br/>
';
use HTML::Restrict;
my $stuff = HTML::Restrict->new(
    trim  => 0,
    rules => {
        br    => [],
    },
);
$stuff->parser->empty_element_tags(1);
print $stuff->process($snippet);
$ perl html-restrict.pl
two element open & close break<br></br>
    one element open & close break <br>
    one element open & close break no space<br>

I'm not sure advocating that people mess with the parser is the best solution here, but maybe this should be an option that could be passed to HTML::Restrict. I'm not sure if the option should be to preserve the old (confusing) behaviour or not, though.

Take note that the <br> tag does get rewritten with the slash removed, but that may be fine in a lot of cases: https://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br

oalders commented 7 years ago

@haarg any opinions on this?