Open ricleal opened 5 years ago
Currently there isn't. Whitespace normalization is the first thing that happens, before anything else, before even parsing the HTML fragment:
That being said, now that we have a real testsuite (feincms-cleanse didn't have good coverage) I wouldn't be against selectively normalizing whitespace as long as nothing else changes, resp. only whitespace changes without effect. This would probably mean normalizing elem.text
and elem.tail
of all elements except of a few where normalization would be skipped.
Thanks a lot for the reply. It looks like it's not implemented yet but there's a way forward. 👍
https://github.com/matthiask/html-sanitizer/blob/c91448183519c0bd8c37dec9d137ea5090430f77/html_sanitizer/sanitizer.py#L194
Commenting this line doesn't seem to break any tests at first glance, but I see this function removes more kind of whitespace than normalize_whitespace_in_text_or_tail
.
Would including these additional whitespace in that function be enough to not need normalize_overall_whitespace
, as a first step?
Thanks for the great package by the way! This is the last issue I've had with it.
I think that the line does some things which are worthwhile such as normalizing various forms of whitespace. It could do this without collapsing whitespace though -- this decision could be left to normalize_whitespace_in_text_or_tail
. This in turn could only run if we aren't inside <pre>
at the moment (respectively inside a set of whitespace-preserving tags). Note that HTML elements are processed from the end to the beginning and from the inside out, so you'd have to peek ahead in the backlog
deque to find out whether we are inside such an element or not.
I'm having a hard time constructing a non-artificial test case which fails without normalize_overall_whitespace
when copy-pasting content from various sources. This may be because modern rich text widgets for web and/or browsers' contenteditable implementations are smarter than they were 10 years ago and they generally don't produce that ugly HTML anymore. I'm still reluctant to just remove all upfront whitespace normalization though. This is a piece of badly tested code, but code with a long legacy... in fact, a part of this is in use for almost 10 years now (https://github.com/feincms/feincms/commit/0186b470194c3fae3c4d714bcd306eda3cc2b8c4)
Oh, reading your comment again: I think it might work to just move the functionality inside normalize_whitespace_in_text_or_tail
but there may be an interaction with only_whitespace_re
where some elements might not be dropped anymore -- but that's just conjecture, I haven't taken a close look at the code.
I just wonder if there's an option to keep the format (e.g. white spaces, tabs, etc) in the
pre
tag. Thanks!