mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.65k stars 253 forks source link

bug: Relative url is removed when the allowed protocol is https #662

Closed LocNguyen-KMS closed 2 years ago

LocNguyen-KMS commented 2 years ago

Describe the bug

bleach.clean does not keep the relative url when the allowed protocol is https

Python and bleach versions

Steps to reproduce the behavior

from bleach.sanitizer import Cleaner
BleachCleaner = Cleaner(
    tags=['a', 'br'],
    attributes={'a': 'href'},
    styles=[],
    protocols=['https'],
    strip=True,
    strip_comments=True
)

description = 'create new study <a href="/path/to/study">Mental study</a>'
BleachCleaner.clean(description)

Expected behavior In bleach version 2.0.0, the relative url is not removed from the result: 'create new study <a href="/path/to/study">Mental study</a>'

Actual behavior In bleach version 3.3.1, the relative url is removed from the result: 'create new study <a>Mental study</a>'

Additional context Actual result in bleach 3.3.1

Screen Shot 2022-05-09 at 11 14 27

Expected result in bleach 2.0.0

Screen Shot 2022-05-09 at 11 21 25
willkg commented 2 years ago

I verified this is still true of Bleach 5.0.0:

> python
Python 3.8.10 (default, Jun 24 2021, 10:31:00) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bleach
>>> bleach.clean("<a href=\"/path/to/thing\">thing</a>", tags=["a"], attributes={"a": "href"})
'<a href="/path/to/thing">thing</a>'
>>> bleach.clean("<a href=\"/path/to/thing\">thing</a>", tags=["a"], attributes={"a": "href"}, protocols=["https"])
'<a>thing</a>'
willkg commented 2 years ago

I think we should fix this. I can think of a couple of options:

  1. Always allow links with no protocol or maybe always allow links with no protocol if http or https is allowed. I can't think of cases where this would be bad, but Bleach has been around for ages and this is the first time I've seen this issue come up, so maybe I'm not imaginative.
  2. Add support for a protocol value that covers protocol-less links. Maybe call it "self" like in CSP? Specifying "self" as an allowed protocol would match protocol-less links. This would allow developers to specify whether or not to allow links that don't specify a protocol. We would add "self" to the list of ALLOWED_PROTOCOLS. I'm not thrilled about having a non-protocol in the protocol list, but I think it'd be ok for now.

I think I like option 2 more.

@g-k What do you think? Can you think of a scenario where option 2 does the wrong thing?

willkg commented 2 years ago

I'm skimming old issues and this is related to PR #565 and issue #611.

g-k commented 2 years ago

It's reasonable for users to expect bleach to match browser and 2.x behavior, but it'd be good to provide users an option to opt out too.

So Option 2 seems preferable. We can update the default schemes later if we want to allow scheme-less URIs by default too.

Can you think of a scenario where option 2 does the wrong thing?

Maybe if there's another protocol registered to self? Nothing else is registered to self:// though, so that might be an option.

We should clarify self vs schema-less links like //thirdparty.com/foo.js too (empty str seems better there but still kind of wrong to use a falsy value).

There is this caveat in the URL spec that I think we should follow:

A URL cannot have a username/password/port if its host is null or the empty string, or its scheme is "file".

Checking bleach and python urlparse behavior against the wpt url tests would probably be good too.

Also, there's an older bug around rewriting the base scheme to javascript:. Bleach doesn't have access to the base URL, but maybe it should take it as an optional arg?

CSP's base-uri mitigates the risk of base overwriting too, so we should probably mention that stuff in the docs e.g. probably don't allow the base element with schema-less or relative links.

willkg commented 2 years ago

I should have read more of the code. We have this currently:

https://github.com/mozilla/bleach/blob/ed06d4e56b70e08fae2dd8f13b6a1955cf106029/bleach/sanitizer.py#L491-L494

I think we should just change that to:

if "http" in allowed_protocols or "https" in allowed_protocols:
    return value

Your comment has a bunch of other tasks:

The caveat in the url spec is interesting, but I don't think we need to do anything here unless we find out browsers are doing the wrong thing.

I'll write up a new issue to check the Bleach vendored urlparse against wpt url tests.

I'm not sure I understand the thought about base url as an optional arg. That sounds like it should be a new issue.

I'm not sure I understand the thought about CSP's base-uri, either, but that also sounds like it should be a new issue.