mrkkrp / modern-uri

Modern library for working with URIs
Other
68 stars 18 forks source link

Allow `[` and `]` in the query component #39

Open kirelagin opened 3 years ago

kirelagin commented 3 years ago

This is a follow up to mrkkrp/req#102. I am going to try to convince you that modern-uri implements RFC 3986 too naively.

I will base my case on the two main arguments:

  1. The specification is ambiguous (or, at least, it is unclear for me personally how to interpret it).
  2. There is practice and it differs from what is currently implemented.

The specification

The Reserved Characters section lists [ and ] as gen-delims, which is part of reserved. Then it says:

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.

This specification does not explicitly state that it is using RFC 2119 keywords, so it is not entirely clear to me how to interpret this “should”, but, I believe, the best option we have is to interpret it as per RFC 2119, and thus web-apps not escaping square brackets are likely in compliance. This alone should be enough to allow them.

It then goes on:

If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character's encoding in US-ASCII.

Now this “no delimiting role is known” part is pretty mysterious. The paragraph right above the one I have been quoting seems to be an attempt at explaining what this means, but I am having a hard time understanding it.

If we go now to the Query section, it reads:

The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.

which suggest that anything that is not a # is not a delimiter or, at least, the delimiting role of [ and ] is unknown (to me 🙂). This part of text may be non-normative and it contradicts the ABNF right below it, but the specification never explains what is normative and what is informative. The intent seems to be to disallow the square brackets, but it is not that obvious, in my opinion.

Given how ambiguous all this is, I am not convinced that implementing exactly the ABNFs from the specification is enough.

The real world

The discussion in the req issue was based around the handling of square brackets in real-world browsers. A couple of years ago there was a discussion about this in the Firefox issue tracker and there is a fantastic survey of behaviours of various browsers.

They surveyed specifically what the browsers were sending to the server, while the discussion on req was about “my browser converts” – I am not sure what exactly you were testing. Note that the actual URL location in that issue was already encoded (likely, by GitHub’s markdown processor): the HTML code of the link was <a href="https://gitlab.com/morley-framework/morley/-/issues/new?issue%5Btitle%5D=Indigo%20website:%20" rel="nofollow">https://gitlab.com/morley-framework/morley/-/issues/new?issue[title]=Indigo%20website:%20</a>. And even when I follow this encoded link in Firefox, it decodes the brackets. This is a result of that survey I linked – as you can see, Firefox was the only browser forcefully encoding square brackets, so this was considered a bug and was changed.

I can’t make a hyperlink with unencoded brackets here on GitHub, but if you paste https://gitlab.com/morley-framework/morley/-/issues/new?issue[title]=Indigo%20website:%20 into the address bar of your Google Chrome it will (I hope – at least that is what I see testing in Chromium) not encode the brackets and send them as is.

Thus, we can see that, regardless of what the specification has to say, in practice the square brackets are thought to be allowed in the query component and not allowing them was fixed as a bug in a major web-browser.

kirelagin commented 3 years ago

I have to admit that the first part of my text ignores the elephant in the room: RFC 3986 was obsolete by WHATWG URL and the latter very explicitly prohibits [ and ] in the query component.

But the second part still stands and, I imagine, there are plenty of people who are not aware of the fact that their square brackets are now definitely illegal :/.

kirelagin commented 3 years ago

One alternative would be to redefine “modern URI” to mean “modern URL according to WHATWG”. Then disallowing the square brackets is the right behaviour indeed, but I am not sure if this library is actually in compliance. E.g. the parsing algorithms is almost certainly not. Here is the basic URL parser specification for reference.

mrkkrp commented 3 years ago

What do other packages do? Let's say network-uri and uri-bytestring?

kirelagin commented 3 years ago

I believe, network-uri implements the RFC naively (i.e. fails on []).

uri-bytestring is kinda weird in that it only parses &-separated key=value pairs, BUT it uses a customisable predicate for testing the validity of characters and ships two predefined tests, with laxURIParserOptions allowing unencoded brackets in queries.

kirelagin commented 3 years ago

(Here is the commit that implements this: https://github.com/Soostone/uri-bytestring/commit/b89303219680170d51a02bd89c11ddae98562222.)

kirelagin commented 3 years ago

I realise this issue is a bit cumbersome, since I was doing the research as I was writing it and changed my opinion on the matter a couple of times in the process.

Currently, I think there are two independent actionable items: