lambdaisland / uri

A pure Clojure/ClojureScript URI library
Mozilla Public License 2.0
243 stars 21 forks source link

Allow query-string keys without values to map->query-string #36

Closed ilmoraunio closed 1 year ago

ilmoraunio commented 1 year ago

This solves my problem I'm running into when we want to add query-string keys that aren't accompanied with values (which should be RFC-compliant).

alysbrooks commented 1 year ago

Thanks for the PR!

Keys without values don't seem to be explicitly mentioned in standards, but standards are quite open about the format of the query string (particularly the URI standard), so I think including this makes sense. It appears the existing behavior of ignoring nil values was to allow for dissocing keys. But since nillable? is opt-in, that behavior is still available.

FWIW, when I try to set a null value for a query string's key using the JavaScript SearchParams API, it sets the key's value to null. I think that may just be how JavaScript does it; I couldn't find an official standard mandating this behavior.

ilmoraunio commented 1 year ago

Hmm, apparently URLSearchParams API is ignorant of valueless query-strings by default!

const s = "foo&bar";
const searchparams = new URLSearchParams(s);
searchparams.toString();
=> 'foo=&bar='

This seems to be purposeful as it follows a WhatWG spec (https://url.spec.whatwg.org/#concept-urlencoded-serializer), from chapter "5.2. application/x-www-form-urlencoded serializing":

  1. Append name, followed by U+003D (=), followed by value, to output.

Unsure if we follow WhatWG here, I leave it up to you to decide.

As I understand it too, the RFC 3986 leaves this particular detail open. That said, looking at the RFC's syntax, our approach should be supported:

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

   query         = *( pchar / "/" / "?" )

   fragment      = *( pchar / "/" / "?" )

   pct-encoded   = "%" HEXDIG HEXDIG

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved      = gen-delims / sub-delims
   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="
alysbrooks commented 1 year ago

I don't have a strong opinion, since it seems like JavaScript can still parse the format this PR would produce. I'll see what Arne thinks.

That being said, creating a query string with uri that has an empty value and one with JavaScript that has an empty value will produce results that are not equal to each other:

const s = "foo&bar";
const searchparams = new URLSearchParams(s);
searchparams.toString(); 
const s2 = "foo=null&bar";
const searchparams2 = new URLSearchParams(s2);
searchparams2.toString(); 
searchparams.get("foo") === searchparams2.get("foo")
//-> false
plexus commented 1 year ago

The query part of a URL/URI is just an ASCII string, from the WHATWG spec:

A URL’s query is either null or an ASCII string. It is initially null.

And we treat it as such

(:query (uri/uri "http://example.com?hello"))
;;=> "hello"

It is customary to encode key-values in there as x-www-form-urlencoded, and so we provide some helpers for further parsing and generating this string, if that is what you are doing. If you want different behavior you can write your own helpers. Some people put base64 encoded json in there. That's all up to the caller.

Using single identifiers rather than a pair separated by = is not something I've seen before, are there particular frameworks or libraries that promote this usage? Is there an ecosystem where this is common practice?

Reading the spec it seems that the parsing sections recognizes foo&bar, but when serializing the result is foo=&bar=. You could argue this is what we should've done when encountering nil values, but that ship has sailed, it's not a breaking change I'm willing to make at this point. I'm also not a fan of a special case boolean flag. The caller is able to distinguish between nil and empty string if it really wants to encode a key without value.

(uri/map->query-string {:foo "", :bar nil})
;; => "foo="

Note also that the spec's encoding algorithm starts with

Assert: tuple’s name and tuple’s value are scalar value strings.

So it doesn't define how to handle nil/null cases.

ilmoraunio commented 1 year ago

Using single identifiers rather than a pair separated by = is not something I've seen before, are there particular frameworks or libraries that promote this usage? Is there an ecosystem where this is common practice?

So yeah, apparently it's possible to have a URL as the query-string key. The use case is an (ad) tracker URL carried within the query-string, which we parse, add some query-params, serialize back to query-string, and then pass the whole URL back to the client. Example URL:

https://ad.doubleclick.net/foo/bar/B123.456;ltd=?https://www.my-ecommerce-shop.com/find-shop?utm_source=prospecting&utm_medium=banner&utm_campaign=my-campaign-foobar&utm_content=lol

Now, the above URL is not so much of a problem since the parsing downstream will still probably be able to cope with the extra = at the end. This however could be more of a problem:

https://ad.doubleclick.net/foo/bar/B123.456;ltd=?https://www.my-ecommerce-shop.com/find-shop

Appending a = here would cause a 404 at the clientside. Hence, the reason for wanting to avoid any extra unwanted characters.

Note also that the spec's encoding algorithm starts with

Assert: tuple’s name and tuple’s value are scalar value strings.

So it doesn't define how to handle nil/null cases.

Fair. :-)


Looking at this from afar, I don't think there's a way out of this using the WhatWG spec. I'll yield back on this PR for now, recommend closing ... unless someone can find a loophole from somewhere in the spec. ;-)

plexus commented 1 year ago

There's a bit of contention around which spec is "the" spec for URI/URLs, this library is mainly based on the RFCs. The WhatWG spec is meant to supersede the RFCs, and to revert to using only the name "URL", but not everyone seems to agree with that. The WhatWG spec takes a very browser-centric point of view, and is more an implementation guide than a spec (it outlines exact algorithms, but not e.g. formal grammars).

In the RFC world, the query string is essentially opaque. You can put anything in there (assuming you %-encode special characters and separators). We provide helpers for the common case of treating the query string as key-value pairs, but that's just a convention, and if you're sticking anyting "special" in your query string that isn't k=v&l=w then you should deal with that yourself.

In a WhatWG world this de facto convention has in fact become de jure. And in their formulation nulls are not valid.

So... if you need to generate URLs with complete URLs embedded in the query string, then I suggest you make your own helpers for that. Correctly percent-encode, and then (assoc uri :query (my-query-helper)).