Open janvogt opened 4 years ago
Dropping consecutive slashes in paths is a normalization procedure.
I see. But it makes this lib unusable in contexts where data: URIs might exist. If that's intended, this can be closed.
The problem here is that the slashes end up being interpreted as delimiters in a path. The uri
value doesn't look like a valid URI to me.
As far as I understand RFCs, it is valid though. See the example at RFC 2397 which also contains consecutive slashes.
I'd really like to use the library, but since data:
URIs are pretty ubiquitous in modern web, this library is unusable for manipulating websites without support for it. Maybe @mrkkrp can confirm that supporting RFC 2397 is out of scope for this project?
Would you like to open a PR to add support for data:
scheme?
Actually, I did a little more digging. Let me tell you, that I become to appreciate this library striving to follow the general URI handling rules laid out in RFC 3986. I think it would be wrong to handle scheme specific cases here, be it data:
or something else.
However, the normalisation of removing //
(or more technically: empty path segments), while o.K. for http(s):
et. al., is not warranted for generic URIs:
Section 1.2.3 states about hierarchical paths:
For some URI schemes, the visible hierarchy is limited to the scheme itself: everything after the scheme component delimiter (":") is considered opaque to URI processing. Other URI schemes make the hierarchy explicit and visible to generic parsing algorithms.
Section 3.3 defines path segments as zero or more characters:
segment = *pchar
Section 6.1 states about normalisation (emphasis mine):
Because URIs exist to identify resources, presumably they should be considered equivalent when they identify the same resource. However, this definition of equivalence is not of much practical use, as there is no way for an implementation to compare two resources unless it has full knowledge or control of them. For this reason, determination of equivalence or difference of URIs is based on string comparison, perhaps augmented by reference to additional rules provided by URI scheme definitions.
and the central guiding principle for normalisation:
Therefore, comparison methods are designed to minimize false negatives while strictly avoiding false positives.
Removing //
in data:
URIs though, leads to false positives and is thusly non compliant.
I see two main solutions, going forward
1) Allow empty path segments in the general case.
This conforms to the standard and should make the roundtrip of the OP possible. I'd see this as the correct(tm) way to solve this issue. The problem though is, that it might break dependent code, that assumes the normalization as currently performed. We could provide a normalizePath :: URI -> URI
function to make the fix as easy as possible.
2) Allow empty path in the general case and perform normalisation for http(s):
and ftp:
This would be a more compatible solution. The problem is, that it's unprincipled: so for these protocols we provide normalization on top of the RFC and for other (hierarchical) protocols we don't?
3) Allow empty path in the general case and perform normalisation for all known protocols that allow it.
This would be the most compatible solution. The problem though is that it requires a lot of research. Also shouldn't we then other protocol specific normalisation as well, to improve the Eq instance as much as possible? This would need even more research.
4) Do nothing
This is the worst option IMO because it breaks the compliance of this, otherwise pretty nice, URI library.
I'd be happy to take a stab at either solution 1 or 2 and provide a pull request. Solution 3 would require to much research to commit on doing it alone. Option 4 does not involve any work.
One more thing: IMHO the roundtrip property
import Text.URI
prop_roundtrip :: Text -> Bool
prop_roundtrip uri = render <$> mkURI uri == (pure uri)
should really hold for all possible values of uri
. That is either parsing fails or the URL can be reproduced exactly as it was. This is incompatible with implicit normalisation, e.g. all solutions but number 1.
I think the best way forward is starting with 2 and gradually as users request normalization for more schemes we will move forward to 3. If I understand correctly the slashes in your original example are just part of the binary data, they are not delimiters of a path?
Yes indeed it's just base64 encoded binary. Which is totally fine with the generic parsing algorithm, as long as there is no normalisation happening by incorrectly assuming a hierarchy. If I understand RFC 3986 correctly, the whole part after the data:
scheme is considered as path in this case, though not an hierarchical one.
Just to be clear: Solution (2) and (3) are out of the RFC 3986 spec, insofar as it recommends only basic string comparison for equivalence checks in the generic case.
Additionally having some normalisation for some protocols, seems to be a little confusing. But if that's the way to go forward I'll try to implement a PR for solution (2)
Yes, please go ahead with (2) :+1:
Ok, I'm on it.
I realised that round tripping a data uri breaks it. E.g.
I assume the data-scheme is not RFCs compliant - so it's probably not a bug.
Nevertheless I wanted to let you know, since it took me half a day to figure out why some of these uri's are broken. Maybe it is possible to have them round-trip correctly (e.g. by allowing empty path elements) or at least fail completely? Silently dropping slashes seems to me like unintended behaviour.