Support for "data:" URI scheme?

janvogt commented 4 years ago

I realised that round tripping a data uri breaks it. E.g.

import Text.URI

let uri = "data:image/gif;base64,R0lGODlhAQABAPAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw=="
in render <$> mkURI uri /= (pure uri) :: Maybe URI

I assume the data-scheme is not RFCs compliant - so it's probably not a bug.

Nevertheless I wanted to let you know, since it took me half a day to figure out why some of these uri's are broken. Maybe it is possible to have them round-trip correctly (e.g. by allowing empty path elements) or at least fail completely? Silently dropping slashes seems to me like unintended behaviour.

mrkkrp commented 4 years ago

Dropping consecutive slashes in paths is a normalization procedure.

janvogt commented 4 years ago

I see. But it makes this lib unusable in contexts where data: URIs might exist. If that's intended, this can be closed.

mrkkrp commented 4 years ago

The problem here is that the slashes end up being interpreted as delimiters in a path. The uri value doesn't look like a valid URI to me.

janvogt commented 4 years ago

As far as I understand RFCs, it is valid though. See the example at RFC 2397 which also contains consecutive slashes.

janvogt commented 3 years ago

I'd really like to use the library, but since data: URIs are pretty ubiquitous in modern web, this library is unusable for manipulating websites without support for it. Maybe @mrkkrp can confirm that supporting RFC 2397 is out of scope for this project?

mrkkrp commented 3 years ago

Would you like to open a PR to add support for data: scheme?

janvogt commented 3 years ago

Actually, I did a little more digging. Let me tell you, that I become to appreciate this library striving to follow the general URI handling rules laid out in RFC 3986. I think it would be wrong to handle scheme specific cases here, be it data: or something else.

Problem

However, the normalisation of removing // (or more technically: empty path segments), while o.K. for http(s): et. al., is not warranted for generic URIs:

Section 1.2.3 states about hierarchical paths:

For some URI schemes, the visible hierarchy is limited to the scheme itself: everything after the scheme component delimiter (":") is considered opaque to URI processing. Other URI schemes make the hierarchy explicit and visible to generic parsing algorithms.

Section 3.3 defines path segments as zero or more characters:

segment = *pchar

Section 6.1 states about normalisation (emphasis mine):

Because URIs exist to identify resources, presumably they should be considered equivalent when they identify the same resource. However, this definition of equivalence is not of much practical use, as there is no way for an implementation to compare two resources unless it has full knowledge or control of them. For this reason, determination of equivalence or difference of URIs is based on string comparison, perhaps augmented by reference to additional rules provided by URI scheme definitions.

and the central guiding principle for normalisation:

Therefore, comparison methods are designed to minimize false negatives while strictly avoiding false positives.

Removing // in data: URIs though, leads to false positives and is thusly non compliant.

Solutions

I see two main solutions, going forward

1) Allow empty path segments in the general case.

This conforms to the standard and should make the roundtrip of the OP possible. I'd see this as the correct(tm) way to solve this issue. The problem though is, that it might break dependent code, that assumes the normalization as currently performed. We could provide a normalizePath :: URI -> URI function to make the fix as easy as possible.

2) Allow empty path in the general case and perform normalisation for http(s): and ftp:

This would be a more compatible solution. The problem is, that it's unprincipled: so for these protocols we provide normalization on top of the RFC and for other (hierarchical) protocols we don't?

3) Allow empty path in the general case and perform normalisation for all known protocols that allow it.

This would be the most compatible solution. The problem though is that it requires a lot of research. Also shouldn't we then other protocol specific normalisation as well, to improve the Eq instance as much as possible? This would need even more research.

4) Do nothing

This is the worst option IMO because it breaks the compliance of this, otherwise pretty nice, URI library.

Offer

I'd be happy to take a stab at either solution 1 or 2 and provide a pull request. Solution 3 would require to much research to commit on doing it alone. Option 4 does not involve any work.

janvogt commented 3 years ago

One more thing: IMHO the roundtrip property

import Text.URI

prop_roundtrip :: Text -> Bool
prop_roundtrip uri = render <$> mkURI uri == (pure uri)

should really hold for all possible values of uri. That is either parsing fails or the URL can be reproduced exactly as it was. This is incompatible with implicit normalisation, e.g. all solutions but number 1.

mrkkrp commented 3 years ago

I think the best way forward is starting with 2 and gradually as users request normalization for more schemes we will move forward to 3. If I understand correctly the slashes in your original example are just part of the binary data, they are not delimiters of a path?

janvogt commented 3 years ago

Yes indeed it's just base64 encoded binary. Which is totally fine with the generic parsing algorithm, as long as there is no normalisation happening by incorrectly assuming a hierarchy. If I understand RFC 3986 correctly, the whole part after the data: scheme is considered as path in this case, though not an hierarchical one.

Just to be clear: Solution (2) and (3) are out of the RFC 3986 spec, insofar as it recommends only basic string comparison for equivalence checks in the generic case.

Additionally having some normalisation for some protocols, seems to be a little confusing. But if that's the way to go forward I'll try to implement a PR for solution (2)

mrkkrp commented 3 years ago

Yes, please go ahead with (2) :+1:

janvogt commented 3 years ago

Ok, I'm on it.

mrkkrp / modern-uri