matthiask / html-sanitizer

Allowlist-based HTML cleaner
BSD 3-Clause "New" or "Revised" License
125 stars 23 forks source link

don't include text from sanitised elements? #33

Open AntiSol opened 1 year ago

AntiSol commented 1 year ago

Hi, thanks for html_sanitizer, I've been using it for a while and it's been doing a great job.

I was wondering if there's a way to have html_sanitizer not include the text from elements that it strips out. For example, given:

<badtag>bad text</badtag>
<p>good text</p>

I would like to have html-sanitizer return <p>good text</p>, but at the moment I get back bad text<p>good text</p>

I looked into doing this with an element_preprocessor, I was thinking I could simply set the text to an empty string if the tag wasn't in the list of allowed tags, but it seems that elements which aren't in the allow list don't make it into the preprocessors, e.g the preprocessor never fires for the element because the sanitizer is stripping that tag before it gets to that stage.

Any suggestions would be appreciated! :)

matthiask commented 1 year ago

Hi

Thanks!

I think that's not possible currently; the only place where a tag is dropped including its contents is here: https://github.com/matthiask/html-sanitizer/blob/e0937eb9af3f4dbecf3ca35d37af0e7e6cd55c07/html_sanitizer/sanitizer.py#L309-L317

Maybe something like this would work:

Or if that doesn't work you could try adding support for dropping tags incl. their contents to the sanitizer? I don't immediately see the value of this change to my use cases, but the addition shouldn't bring too much code with it and as long as the output is well tested this doesn't seem to burdensome.

AntiSol commented 1 year ago

Thanks for the response :)

I'm trying to be very strict about what I let through as I sanitise user-provided XML (not HTML), it would be useful to me to have this feature.

I'm not sure which way I'll go with this just yet, but seeing as you're amenable to having the functionality added, I just might go in that direction. You'll be the first to know if I do.

I'm not sure how you like to manage your issue tracker. If you like I'm happy for you to close this (and I can reopen a new issue, or just a pull request, if I do implement something), or perhaps you'd prefer to mark it a feature request?

Thanks for your time (and code!)

AntiSol commented 1 year ago

aah yes and your suggestion sounds sensible, but my problem now is that I don't actually just want to remove text from <badtag>, but from any tag that isn't in the allow list. Sorry I wasn't clear about that in my initial question.

matthiask commented 1 year ago

Don't worry, you can leave it open if you want to -- I'll maybe close it in a few months or years if it hasn't been addressed by then, or not.

aah yes and your suggestion sounds sensible, but my problem now is that I don't actually just want to remove text from , but from any tag that isn't in the allow list. Sorry I wasn't clear about that in my initial question.

Ah yes, that's different. Maybe the element preprocessor could receive the sanitizer instance so that it can inspect the list of allowed tags. The pre- and postprocessors could be made to return the tag if it should be kept and return nothing if the element or tree has been dropped already? Making this backwards compatible would require some thinking.

Or maybe there's a better way to do it.