matthiask / django-prose-editor

ProseMirror-based HTML editor for Django
https://django-prose-editor.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
87 stars 5 forks source link

Passing tags and attributes to nh3.clean() #9

Open carltongibson opened 1 month ago

carltongibson commented 1 month ago

Hi @matthiask — thanks for the great package, as always 🎁

I have a question about the nh3 sanitizer:

https://github.com/matthiask/django-prose-editor/blob/1a18e77c1ceb52f0b678e18d1c8dec8c21d267f7/django_prose_editor/sanitized.py#L4-L7

Here we’re using clean but I wonder if we could pass the tags and attributes arguments as well?

Deriving those from the types used to instantiate the prose editor would allow precisely limiting the allowed HTML, I’m hoping.

What do you think?

matthiask commented 1 month ago

Thank you :)

So, I'm not 100% sure if this would be better achieved through settings, field arguments or some other way or if maybe it shouldn't be done at all.

The argument for not making the nh3 invocation configurable is that nh3's job here isn't to clean up the HTML, it's only to protect against XSS when someone circumvents the ProseMirror widget. As long as editors do not do anything strange ProseMirror will always serialize a cleaned up version of the HTML fragment already, even if you use the raw HTML popup. (That's also the reason why I use nh3 instead of something more opinionated such as html-sanitizer.)

carltongibson commented 1 month ago

So @adamchainz wrote a post on nh3 in which he says:

The nh3 defaults (from Ammonia) are generally safe, allowing only general content tags and attributes, but they are still pretty wide, allowing 75 different tags. Allowing some tags on this default list, such as <article>, may lead to surprising results. In the worst case, an attacker may be able to craft official-looking content with evil instructions to users.

Typically, incoming HTML fragments come from a rich text editor on your site. In this case, use the arguments of nh3.clean() to limit allowed tags and attributes to only those your editor will create. This way, there won’t be any chance of surprise content. (src)

I find that kind of compelling, and we do know the restricted set of tags and attributes that should be allowed (I think 🤔)

SanitizedProseEditorField knows the config that it will pass to the widget at __init__() so could determine the tags and attributes allowed and pass those parameters in a closure to the sanitize function (or similar). (Better than a setting no?)

I think it's safe as is, but could maybe be more so (hardened) by restricting the allowed tags.

No sure either. But just working it through with you.

carltongibson commented 1 month ago

I guess a validator here might show if someone was trying to be naughty 😅

matthiask commented 1 month ago

I find that kind of compelling, and we do know the restricted set of tags and attributes that should be allowed (I think 🤔)

Yes, by looking at e.g. prosemirror-schema-basic, for example the link mark: https://github.com/ProseMirror/prosemirror-schema-basic/blob/4e358f5443f15abe09406c9327e35aa9e3c1bf8c/src/schema-basic.ts#L113-L123

I think the nodes and marks definitions in prosemirror-schema-basic and prosemirror-schema-list have been unchanged (practically) since the beginning. I have had discussions in private about allowing id and name attributes, and maybe target could be useful, but that's a different issue.

SanitizedProseEditorField knows the config that it will pass to the widget at init() so could determine the tags and attributes allowed and pass those parameters in a closure to the sanitize function (or similar). (Better than a setting no?)

Deriving the list of allowed attributes and tags from the types list sounds like the way to go!

I think it's safe as is, but could maybe be more so (hardened) by restricting the allowed tags. [...] I guess a validator here might show if someone was trying to be naughty 😅

I totally agree with that, and I think there's value in hardening! A little more opinions on the server side would certainly be useful. Just to spell it out for everyone else, the cleaning is already more opinionated than nh3 is since e.g. the empty pararaph <p></p> is replaced by an empty string so that "no content" is actually falsy instead of truthy. So, adding this additional hardening wouldn't be introducing completely new concepts into the code.

carltongibson commented 1 month ago

Great. Makes sense. I will try and prototype something when I get a small window.