Open hexylena opened 3 months ago
Ruby's uri
standard library (which I assume our URL validator uses under-the-hood) does not like the #
in the URL fragment*
I tried parsing that URL with various other libraries in different languages: Python 3's urllib, Firefox's JavaScript implementation, and the addressable
Ruby gem all handled it fine.
I spent a bit of time reading the spec, and unfortunately it seems to be a case of everyone else being wrong.
According to https://datatracker.ietf.org/doc/html/rfc3986#section-3.5 a #
in the URL fragment needs to be percent-encoded (as %23
).
* Actually even it seems to not agree with itself, it will not escape the #
if you manually set it in the fragment (it does for other special characters), and will produce a URI that itself cannot parse:
3.1.4 :037 > u = URI.parse("https://matrix.to/")
=> #<URI::HTTPS https://matrix.to/>
3.1.4 :038 > u.fragment = "/#Galaxy-Training-Network_Lobby:gitter.im"
=> "/#Galaxy-Training-Network_Lobby:gitter.im"
3.1.4 :039 > u
=> #<URI::HTTPS https://matrix.to/#/#Galaxy-Training-Network_Lobby:gitter.im>
3.1.4 :040 > u.to_s
=> "https://matrix.to/#/#Galaxy-Training-Network_Lobby:gitter.im"
3.1.4 :041 > URI.parse(u.to_s)
/home/finn/.rvm/rubies/ruby-3.1.4/lib/ruby/3.1.0/uri/rfc3986_parser.rb:66:in `split': bad URI(is not URI?): "https://matrix.to/#/#Galaxy-Training-Network_Lobby:gitter.im" (URI::InvalidURIError)
That would explain why it gets improperly highlighted in so many places (even in their own client), if the second #
needs to be encoded. Ok, I will encode it, no need for seek changes. Thanks so much for the help debugging!
Is that a common URL pattern for matrix channels? It may be something we need to support regardless...
To my knowledge that pattern is common to all matrix servers, but I could be wrong.
There are two ways to specify matrix rooms actually https://app.element.io/#/room/!
Interesting!
I looked up the "WHATWG URL Standard" and this is what it says about fragment encoding:
The fragment percent-encode set is the C0 control percent-encode set and U+0020 SPACE, U+0022 ("), U+003C (<), U+003E (>), and U+0060 (`).
The C0 control percent-encode set are the C0 controls and all code points greater than U+007E (~).
A C0 control is a code point in the range U+0000 NULL to U+001F INFORMATION SEPARATOR ONE, inclusive.
#
is U+0023 so does not need to be escaped according to that
I'll see if we can use a more modern URL validator
Describe the bug
Matrix URLs (which are admittedly a bit strange with two
#
) are considered invalid.To Reproduce Steps to reproduce the behavior:
https://matrix.to/#/#Galaxy-Training-Network_Lobby:gitter.im
and textMatrix
Expected behavior
It should save
Screenshots
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context Add any other context about the problem here.