mrkkrp / modern-uri

Modern library for working with URIs
Other
68 stars 18 forks source link

Fixed percent-encoding on query strings with '+' #5

Closed duairc closed 6 years ago

duairc commented 6 years ago

Special handling of + is needed in query strings — in rendering, '+' must be percent-encoded, and any '+' encountered in parsing must be parsed as ' '.

Ideally ' ' would be rendered as '+' as well, but this would need changes in several places, and '%20' isn't wrong, so I haven't attempted to implement that.

mrkkrp commented 6 years ago

Where is this defined? I can't find any mention of special treating of '+' in RFC 3986:

https://tools.ietf.org/html/rfc3986#section-3.4

I'm aware of this feature, but I just want to understand if it's a commonly used convention or something imposed by a document. If the first, the transformation should probably be handled by the user of the library, not the library itself.

duairc commented 6 years ago

You're right, it's not defined in RFC 3986 as far as I know, so technically it's not a URI thing. It is defined in RFC 1866 (HTML 2.0), in paragraph 8.2.1. Note that it only affects query strings: + in the path should be interpreted literally.

There was also a draft RFC describing the application/www-form-urlencoded format but it expired and doesn't seem to have been replaced by anything.

For what it's worth though, this isn't just an aesthetic thing. At the moment I'm writing an implementation of SAML in Haskell (that has to talk to another service written in PHP). SAML basically works by passing base64-encoded, deflated XML documents around in the query string. Base64 produces + characters. I was using modern-uri to build these URLs with the SAML requests in the query string. The other end (written in PHP) kept choking on the input, because it was interpreting the + signs in the Base64 data as `, and it took me a long time to figure out what was happening (whitespace doesn't automatically make base64 encoded data invalid, so it was theinflate` step that was failing on the PHP side).

You're right that I could just do the transformation myself, but there isn't an obvious way of doing this without parsing the whole URL string again myself. Replacing + with %2B is fine, but this should only be done after the ? sign. It seems silly to render the URI as a string, and then parse that string again looking for a ?, just to apply a transformation thereafter.

Basically I'm saying, I understand if you don't want to make it the default behaviour, but it really makes sense for this functionality to exist in modern-uri somewhere, I don't think it should be something users should have to do.

mrkkrp commented 6 years ago

OK, I see. I think the only reasonable solution is to create a record of settings (there may be more of them in the future, who knows) where we'll put a setting controlling whether we should encode/decode space as the plus sign + as per RFC 1866. We then must pass this record, let's call it SerializationSettings maybe, to both renders and parsers.

New tests and a changelog update are also necessary... This will change the API, so it must be a new major version, 0.2 (hopefully the library is still new and it won't break that much code).

If you like, you can try to do this, otherwise I'll do it myself pretty soon.

Thanks for opening the issue!

duairc commented 6 years ago

That sounds reasonable. It also sounds like at least a few hours of work though, and I can't see myself getting the time to do that this week or next.

However, I just had a little look at RFC 3986 again — I don't actually see any mention of special handling of & and = in query strings either. It doesn't seem to define any concept of "query parameters", it just seems to say that everything after the ? is the "query string". As far as I can tell, the concept of query parameters comes from the RFC 1866 that I mentioned. GIven that modern-uri seems to have the concept of query parameters, as far as I can tell, the correct way to handle them is always to use application/x-www-form-urlencoded (i.e., + is ) encoding.

Perhaps then the uriQuery field should actually contain a sum type like:

data Query = QueryString Text | QueryParams [QueryParam]

Parsing would potentially be ambiguous though, so I think you'd still need to pass an option to the parser, but rendering would be unambiguous.

mrkkrp commented 6 years ago

Given that modern-uri seems to have the concept of query parameters, as far as I can tell, the correct way to handle them is always to use application/x-www-form-urlencoded (i.e., + is ) encoding.

Good point. Now I'm inclined to just merge this PR. I need to think about this more first. I'll return to the discussion tomorrow perhaps.

mrkkrp commented 6 years ago

I'm merging this, I'll add some more tests, a changelog note, and other stuff myself. modern-uri-0.1.2.0 with the fix will be on Hackage today.

Thank you for pointing out the problem!