mirage / ocaml-uri

RFC3986 URI parsing library for OCaml
Other
98 stars 57 forks source link

Provide a normalize function #70

Closed Chris00 closed 9 years ago

Chris00 commented 9 years ago

http://validator.w3.org/feed/ says that URI used as identifiers should be in canonical form, as described by section 6 of RFC 3986. It would be great if Uri provided a function

normalize : t -> t

that would return the canonical form. A function

to_canonical : t -> string

would be as good for my needs.

dsheets commented 9 years ago

I believe Uri.(resolve "" uri (of_string "")) should do the normalization you require. It will handle scheme, host, path, and encoding normalization. This (like many behaviors) should be better documented and perhaps should be exposed directly. I'm leaving this open to track that.

Please, let me know if you need different normalization than that provided by resolve or need clarification about what exactly it does.

Chris00 commented 9 years ago

Sorry, it does not work. One needs to provide the canonical form of the Uri.t so comparison amounts to String.compare. Uri already remove any case in the hostname (a good thing). Section 6.2.3 says that an empty path should be normalized to a path of "/". Maybe the Uri.compare needs some improvements too in view of:

# Uri.compare (Uri.of_string "http://x.y") (Uri.of_string "http://x.y/");;
- : int = -1

The section also says that an explicit ":port", for which the port is empty or the default for the scheme, is equivalent to one where the port and its ":" delimiter are elided and thus should be removed by scheme-based normalization. and thus

     http://example.com
     http://example.com/
     http://example.com:/
     http://example.com:80/

are equivalent (and likewise for https,...). I guess other normalizations should also be performed: for example http://x.y/ is equivalent to http://x.y/?,... The devil is in not forgetting any...

dsheets commented 9 years ago

The failure to elide the port is a bug. The failure to normalize a missing path to the root path may also be a bug but is somehow less obvious to me (e.g. the comparison example). Your query string example is not a valid normalization as it changes bytes-on-the-wire and servers are free to interpret the ? and everything after it as they see fit.

Improving normalization is definitely an important issue that we should work on.