lo48576 / iri-string

String types for URIs/IRIs.
Apache License 2.0
15 stars 3 forks source link

Should `foo:.///bar` get normalized to `foo:/.//bar`? #28

Closed yescallop closed 2 years ago

yescallop commented 2 years ago

As is currently implemented in iri-string, the URI foo:.///bar gets normalized to foo:/.//bar, in which the rootless path .///bar gets normalized to an absolute path /.//bar. This behaviour does change the semantics of the path and therefore might not be desired. I suppose this can be addressed by changing the routine of path prepending to "If the path was originally absolute, prepend it with /.; otherwise, i.e., if the path was originally rootless, prepend it with ./".

I would suggest some change be made in the WHATWG URL Standard if this is a real issue, but before filing it there I'd love to hear your thoughts on this :)

By the way, while the non-idempotence in URL serialization has been fixed in the WHATWG URL Standard, the issue still remains untouched in IETF RFC 3986. I figure we can have a discussion on this with the IETF folks and possibly file an errata report for the RFC. Any thoughts?

lo48576 commented 2 years ago

Semantics change

Yes, this algorithm change can change some apps' behavior, and the new algorithm is not made default until v0.6.0 because of that.

I have an assumption that this kind of "abnormal" IRIs are rarely used and such IRIs without authority component usually won't use FS-like path strings, and apps using path component would use its own canonicalization before using it (e.g. std::path::Path::canonicalize()) if necessary. So I made WHATWG URL Standard algorithm default and ensure_rfc3986_normalizable optional. Only users who should handle abnormal slash-delimited paths without authority components should use this check. Typical HTTP users don't need to care (or even won't be aware of) this change.

Handling of a path component in this library

The primary target of this library is IRIs, so I prioritize correct (according to the specs) handling of IRIs more than common ways of handling of paths. For example, foo//bar is usually treated as foo/bar in local FS context, but this kind of conversion is unacceptable as an IRI library. (I've seen IRI libraries that does this conversion automatically under some condition!) So, I think the recent change in this library is not problematic in the sense of IRI library.

foo:.///bar is decomposed as <scheme=foo>:<path=.///bar> so the "path" is actually .///bar, but I don't think normalizing "path" outside the IRI context makes sense (especially when it is relative path). For example, /foo/../bar in IRI context is normalized to /bar, but this is not always true in the context of local FS (if foo is symlink). And similarly, ../bar is expected to be normalized to /bar, as if the "current directory" is the root directory.

According to RFC 3986, I think the path component of an IRI should be handled as if:

So, foo:.///bar and foo:/.//bar can be safely convertible and "semantically equal" in RFC 3986 context.

If someone needs to handle paths outside of RFC 3986 context, they should not use it directly as a path. (Some common way is escaping the path (like foo:snapshot/%2F.%2F%2Fbar) or making it a query parameter (like foo:snapshot?path=.///bar)

Change in the WHATWG URL Standard?

As described above, I think preserving relative paths without "current directly is the root" assumption would be out of scope of IRI, so I think the current handling of the abnormal paths in WHATWG URL Standard is reasonable.

However, I don't have strong opinion to stop you to file the issue to the spec, since I've not experienced the situation that I need to preserve the relativity of paths in IRIs after normalization. Some people facing with the real world issues would have different opinion than me.

Fix (errata) in the RFC 3986 spec?

The fundamental issue is not non-idempotence, but incompleteness of the algorithm. Result of the string manipulation would be foo://bar, but the spec does neither explicitly reject nor accept it. (It seems the spec is simply not aware of such cases, since they are not listed in "5.4.2. Abnormal Examples".)

I imagine WHATWG people would have decided to prepend the prefix since they don't want to make the normalization and resolution fallible, but I think it is also reasonable to make the normalization and resolution fallible. Anyway, I agree that RFC 3986 should be fixed or at least should receive the report for this problem.

yescallop commented 2 years ago

After a bit of investigation, I notice that the WHATWG URL Standard treats the rootless path in foo:.///bar as opaque and does not require any dot segments to be removed from it: 2.8, 2.9 of scheme state; Live URL Viewer. So, I believe that the WHATWG spec is irrelevant when it comes to whether to preserve the relativity of paths or not, which brings us back to RFC 3986.

The fundamental issue is not non-idempotence, but incompleteness of the algorithm. Result of the string manipulation would be foo://bar, but the spec does neither explicitly reject nor accept it. (It seems the spec is simply not aware of such cases, since they are not listed in "5.4.2. Abnormal Examples".)

I agree with you. I'll try to let the authors of RFC 3986 know about this.

lo48576 commented 2 years ago

After a bit of investigation, I notice that the WHATWG URL Standard treats the rootless path in foo:.///bar as opaque and does not require any dot segments to be removed from it: 2.8, 2.9 of scheme state; Live URL Viewer. So, I believe that the WHATWG spec is irrelevant when it comes to whether to preserve the relativity of paths or not, which brings us back to RFC 3986.

Thank you for investigation. I'll handle this special case in #29 (and #30), though there doesn't seem to be obvious single solution...

lo48576 commented 2 years ago

29 and #30 is fixed. (I'll release them soon as v0.7.0.)

@yescallop Now I think I've done fixes and improvements possible in this crate. You can apply relative-path-preserving normalization (when authority is absent) by newly added RiStr::normalize_but_preserve_authorityless_relative_path()! (WTF, but I have no idea better than this name...)

So, can I close this? Or do you have any more thing to comment?