Closed M1ha-Shvn closed 1 year ago
The url matching code is here:
The comment there suggests that it'll match characters except those denoted as unsafe characters in RFC 1738. Unsafe characters like [
and ]
characters need to be encoded. So that's what's going on here.
RFC 3986 updates RFC 1738 and has a set of reserved characters which includes [
and ]
. I think we need to update the url regex to RFC 3986.
I'm pretty busy for a while. I'll accept a PR if someone wants to do one.
@willkg Removing \[\]
escape in regex gave the expected result, but I'm wondering why ¶ms
is converted to ¶ms
. Is it an issue or an expected behavior for ¶
?
The ¶ issue
from bleach import Linker
linker = Linker()
text= 'http://test.com?¶ms=2'
print(linker.linkify(text))
## prints: <a href="http://test.com?¶ms=2" rel="nofollow">http://test.com?¶ms=2</a>
I believe this happens somewhere in BleachHTMLSerializer class: https://github.com/mozilla/bleach/blob/ed06d4e56b70e08fae2dd8f13b6a1955cf106029/bleach/html5lib_shim.py#L661
The ¶
thing should be a separate issue. This issue is covering array arguments.
@willkg I am sorry, please see https://github.com/mozilla/bleach/issues/670
Thank you! I appreciate it!
Thank you for writing this up!
Hi. Library version up to 3.1.0 incorrectly parses array and object url parameters:
As you see, url is split by [], loosing part of the link.