matthiask / html-sanitizer

Allowlist-based HTML cleaner
BSD 3-Clause "New" or "Revised" License
129 stars 23 forks source link

`pre` tag keep format #10

Open ricleal opened 5 years ago

ricleal commented 5 years ago

I just wonder if there's an option to keep the format (e.g. white spaces, tabs, etc) in the pre tag. Thanks!

matthiask commented 5 years ago

Currently there isn't. Whitespace normalization is the first thing that happens, before anything else, before even parsing the HTML fragment:

https://github.com/matthiask/html-sanitizer/blob/c91448183519c0bd8c37dec9d137ea5090430f77/html_sanitizer/sanitizer.py#L194

That being said, now that we have a real testsuite (feincms-cleanse didn't have good coverage) I wouldn't be against selectively normalizing whitespace as long as nothing else changes, resp. only whitespace changes without effect. This would probably mean normalizing elem.text and elem.tail of all elements except of a few where normalization would be skipped.

ricleal commented 5 years ago

Thanks a lot for the reply. It looks like it's not implemented yet but there's a way forward. 👍

mirukana commented 5 years ago

https://github.com/matthiask/html-sanitizer/blob/c91448183519c0bd8c37dec9d137ea5090430f77/html_sanitizer/sanitizer.py#L194 Commenting this line doesn't seem to break any tests at first glance, but I see this function removes more kind of whitespace than normalize_whitespace_in_text_or_tail. Would including these additional whitespace in that function be enough to not need normalize_overall_whitespace, as a first step? Thanks for the great package by the way! This is the last issue I've had with it.

matthiask commented 5 years ago

I think that the line does some things which are worthwhile such as normalizing various forms of whitespace. It could do this without collapsing whitespace though -- this decision could be left to normalize_whitespace_in_text_or_tail. This in turn could only run if we aren't inside <pre> at the moment (respectively inside a set of whitespace-preserving tags). Note that HTML elements are processed from the end to the beginning and from the inside out, so you'd have to peek ahead in the backlog deque to find out whether we are inside such an element or not.

I'm having a hard time constructing a non-artificial test case which fails without normalize_overall_whitespace when copy-pasting content from various sources. This may be because modern rich text widgets for web and/or browsers' contenteditable implementations are smarter than they were 10 years ago and they generally don't produce that ugly HTML anymore. I'm still reluctant to just remove all upfront whitespace normalization though. This is a piece of badly tested code, but code with a long legacy... in fact, a part of this is in use for almost 10 years now (https://github.com/feincms/feincms/commit/0186b470194c3fae3c4d714bcd306eda3cc2b8c4)

matthiask commented 5 years ago

Oh, reading your comment again: I think it might work to just move the functionality inside normalize_whitespace_in_text_or_tail but there may be an interaction with only_whitespace_re where some elements might not be dropped anymore -- but that's just conjecture, I haven't taken a close look at the code.