oalders / html-restrict

HTML::Restrict - Strip away unwanted HTML tags
Other
10 stars 9 forks source link

Better fix for handling malformed tags #37

Closed haarg closed 5 years ago

haarg commented 5 years ago

Doing filtering multiple times is inefficient and incorrect. Malformed tags would normally be rendered by the browser as text, but filtering multiple times means this module is treating them as tags. Also, coderef based filters or replace_img may react badly if applied multiple times to a single input.

This reverts commits: bb3a2129d0fb4109c04621cbce14bbdcc76f2098 cf1979bbf165bf8a9681d89fafa82b5e849aaf5c a28342962d867e3c02aca8d6f1a4526bf3e7d67a f0b5c25b6af81955d55d885caaa7a5475512989d

Malformed tags will be parsed as text. If these are passed through unmodified, the stripping can allow them to create real tags that should be filtered out.

Text should always be properly encoded. Browsers will still decode entities that don't have a terminating semicolon. Lone ampersands or angle brackets that don't form entities or tags will be treated as literals. Fix the encoding of all of these so that stripping can't make them into different entities or tags.

haarg commented 5 years ago

I'm not entirely satisfied with how the encoding fix is done here, but I couldn't see a better option.

haarg commented 5 years ago

The other option for fixing the encoding would be to do encode_entities(decode_entities($text)), although that may change the form of some correctly encoded entities.

oalders commented 5 years ago

What concerns me here is this behaviour:

$ perl -Ilib -MHTML::Restrict -E 'say HTML::Restrict->new->process("You & I")'
You & I

That's going to be problematic in the case where I'm just stripping content from user-supplied text fields which should contain zero HTML and I'm not expecting to get encoded entities back.

haarg commented 5 years ago

If they are meant to contain text, why would you not just encode_entities on them and not bother with any stripping?

oalders commented 5 years ago

Good point. Is this ready for review?

haarg commented 5 years ago

It should be.