seek4science / seek

For finding, sharing and exchanging Data, Models, Simulations and Processes in Science.
http://www.seek4science.org
BSD 3-Clause "New" or "Revised" License
76 stars 52 forks source link

Cannot add matrix links as 'discussion channels' #1928

Open hexylena opened 3 months ago

hexylena commented 3 months ago

Describe the bug

Matrix URLs (which are admittedly a bit strange with two #) are considered invalid.

To Reproduce Steps to reproduce the behavior:

  1. Edit your workflow
  2. Add a discussion channel like https://matrix.to/#/#Galaxy-Training-Network_Lobby:gitter.im and text Matrix
  3. Save
  4. See error

Expected behavior

It should save

Screenshots

image

Desktop (please complete the following information):

Smartphone (please complete the following information):

Additional context Add any other context about the problem here.

fbacall commented 3 months ago

Ruby's uri standard library (which I assume our URL validator uses under-the-hood) does not like the # in the URL fragment*

I tried parsing that URL with various other libraries in different languages: Python 3's urllib, Firefox's JavaScript implementation, and the addressable Ruby gem all handled it fine.

I spent a bit of time reading the spec, and unfortunately it seems to be a case of everyone else being wrong.

According to https://datatracker.ietf.org/doc/html/rfc3986#section-3.5 a # in the URL fragment needs to be percent-encoded (as %23).

* Actually even it seems to not agree with itself, it will not escape the # if you manually set it in the fragment (it does for other special characters), and will produce a URI that itself cannot parse:

3.1.4 :037 > u = URI.parse("https://matrix.to/")
 => #<URI::HTTPS https://matrix.to/> 
3.1.4 :038 > u.fragment = "/#Galaxy-Training-Network_Lobby:gitter.im"
 => "/#Galaxy-Training-Network_Lobby:gitter.im" 
3.1.4 :039 > u
 => #<URI::HTTPS https://matrix.to/#/#Galaxy-Training-Network_Lobby:gitter.im> 
3.1.4 :040 > u.to_s
 => "https://matrix.to/#/#Galaxy-Training-Network_Lobby:gitter.im" 
3.1.4 :041 > URI.parse(u.to_s)
/home/finn/.rvm/rubies/ruby-3.1.4/lib/ruby/3.1.0/uri/rfc3986_parser.rb:66:in `split': bad URI(is not URI?): "https://matrix.to/#/#Galaxy-Training-Network_Lobby:gitter.im" (URI::InvalidURIError)
hexylena commented 3 months ago

That would explain why it gets improperly highlighted in so many places (even in their own client), if the second # needs to be encoded. Ok, I will encode it, no need for seek changes. Thanks so much for the help debugging!

fbacall commented 3 months ago

Is that a common URL pattern for matrix channels? It may be something we need to support regardless...

hexylena commented 3 months ago

To my knowledge that pattern is common to all matrix servers, but I could be wrong.

There are two ways to specify matrix rooms actually https://app.element.io/#/room/!:matrix.org is also used occasionally.

relatedly, https://timothygu.me/urltester/#input=https%3A%2F%2Fmatrix.to%2F%23%2F%23Galaxy-Training-Network_Lobby%3Agitter.im

fbacall commented 3 months ago

Interesting!

I looked up the "WHATWG URL Standard" and this is what it says about fragment encoding:

The fragment percent-encode set is the C0 control percent-encode set and U+0020 SPACE, U+0022 ("), U+003C (<), U+003E (>), and U+0060 (`).

The C0 control percent-encode set are the C0 controls and all code points greater than U+007E (~).

A C0 control is a code point in the range U+0000 NULL to U+001F INFORMATION SEPARATOR ONE, inclusive.

# is U+0023 so does not need to be escaped according to that

I'll see if we can use a more modern URL validator