mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.65k stars 253 forks source link

bug: Linkify incorrectly parses query params starting with "&para" #670

Closed filak closed 1 year ago

filak commented 2 years ago

To Reproduce

from bleach import Linker
linker = Linker()
text = 'http://test.com?a=1&par=1&parameterA=2'
print(linker.linkify(text))
## prints:   <a href="http://test.com?a=1&amp;par=1¶meterA=2" rel="nofollow">http://test.com?a=1&amp;par=1¶meterA=2</a>

Expected behavior

## prints:   <a href="http://test.com?a=1&amp;par=1&amp;parameterA=2" rel="nofollow">http://test.com?a=1&amp;par=1&amp;parameterA=2</a>

Additional context

I believe this might happen somewhere in the html5lib_shim.py / BleachHTMLSerializer class: https://github.com/mozilla/bleach/blob/ed06d4e56b70e08fae2dd8f13b6a1955cf106029/bleach/html5lib_shim.py#L661

willkg commented 2 years ago

&para is being consumed as an entity. We fixed this in clean and I think we need to fix linkify in a similar way.

filak commented 2 years ago

There are more entities with the same effect, ie. &not &reg :

from bleach import Linker
linker = Linker()
text = 'http://test.com?a=1&notify=1&register=2'
print(linker.linkify(text))
## prints:   <a href="http://test.com?a=1¬ify=1®ister=2" rel="nofollow">http://test.com?a=1¬ify=1®ister=2</a>
jvanasco commented 1 year ago

Adding for context:

This is related to #294 . The W3C calls this "fragile syntax".

IIRC, prior to the HTML5 spec the trailing semicolon for named references was NOT required, but it has been required since then. (see "Errors involving fragile syntax constructs" in the original https://dev.w3.org/html5/spec-LC/Overview.html and the current https://html.spec.whatwg.org/#syntax-errors )