mirage / ocaml-uri

RFC3986 URI parsing library for OCaml
Other
98 stars 57 forks source link

Slashes in path components #35

Closed mjambon closed 10 years ago

mjambon commented 10 years ago
# Uri.(to_string (of_string "/foo%2Fbar"));;
- : string = "/foo/bar"

Result should be:

# Uri.(to_string (of_string "/foo%2Fbar"));;
- : string = "/foo%2Fbar"

It seems that some functions treat the path as a whole and encode/decode the whole path instead of each component. As a result, simply marking / as unsafe doesn't solve the problem.

I will let you know if I start working on a fix.

mjambon commented 10 years ago

It's late and everything but there also seems to be a recent problem with spaces:

# Uri.path (Uri.of_string "/a%20b");;
- : string = "/a b"

I'll follow up tomorrow.

dsheets commented 10 years ago

The spaces problem is not a problem. Uri.path returns a human string representation of the component. The slashes problem is a problem and we will have to maintain path components to solve it.

avsm commented 10 years ago

I agree with @dsheet's triage -- do you really want the space representation preserved as well?

-anil

On 6 Feb 2014, at 10:43, David Sheets notifications@github.com wrote:

The spaces problem is not a problem. Uri.path returns a human string representation of the component. The slashes problem is a problem and we will have to maintain path components to solve it.

— Reply to this email directly or view it on GitHub.

mjambon commented 10 years ago

What we need for our application is extract the path as a string list from a Uri.t. For instance we want to obtain [""; "foo/bar"; "baz"] from "/foo%2Fbar/baz".

mjambon commented 10 years ago

Cohttp client uses Uri.path_and_query which like Uri.path produces an incorrect path in the HTTP request. File request.ml:

  let write_header req oc =
   let fst_line = Printf.sprintf "%s %s %s\r\n" (Code.string_of_method req.meth)
      (Uri.path_and_query req.uri) (Code.string_of_version req.version) in

Interestingly, nginx copes with spaces pretty well except when a space is followed by the letter 'H' (as in HTTP), which causes the request to fail. Our problematic URL was http://localhost/something/Palo%20Alto%20City%20Hall :-)

mjambon commented 10 years ago

In summary, I would prefer to have pairs of functions that read and write standard-compliant URL components by default, plus some extras for pretty-printing if that's something people want. Assuming backward compatibility is not an issue, I'm suggesting an interface along those lines:

type url_compatible = string
  (* string that may occur as a substring of a valid URL *)

type plain = string
  (* raw sequence of bytes *)

val of_string : url_compatible -> t

val to_string : t -> url_compatible
  (* returns a URL, where special characters including slash (`/`)
     within path segments are properly escaped (`%2F`).

     to_string (of_string "/a%2Fb") = "/a%2Fb"
   *)

val segments_of_path : url_compatible -> plain list
val path_of_segments : plain list -> url_compatible
  (*
     segments_of_path "/a%2Fb/c" = [""; "a/b"; "c"]
     path_of_segments [""; "a/b"; "c"] = "/a%2Fb/c"
   *)

val path : t -> url_compatible
  (* path (of_string "http://example.com/a%2Fb") = "/a%2Fb" *)

val path_segments : t -> plain list
  (*
     path_segments (of_string "a/b") = ["a"; "b"]
     path_segments (of_string "/a/b") = [""; "a"; "b"]
     path_segments (of_string "/a%2Fb/c") = [""; "a/b"; "c"]
   *)

val path_pretty : t -> string
  (*
     Either:

       path_pretty (of_string "/a%2Fb/The%20Hulk") = "/a%2Fb/The Hulk"

     or perhaps the irreversible:

       path_pretty (of_string "/a%2Fb/The%20Hulk") = "/a/b/The Hulk"

       (this is the current Uri.path implementation)
   *)
dsheets commented 10 years ago

These are actually 2 related bugs:

I'll try to work on a patch for this tomorrow. It shouldn't be too involved. See the failing tests.

And yes, the interface needs massive clean-up, compat breaking changes, and support for a number of other standards including RFC 3987, RFC 6570, and recursive fragment identifiers described by Errata 3330.

dsheets commented 10 years ago

Added another bug that was marked with the same known-error as the path delimiter bug this time with colons in userinfo components.

I hope to get a 2.0 release out sometime this spring. Thanks for your patience and your great report!