mirage / ocaml-uri

RFC3986 URI parsing library for OCaml
Other
97 stars 57 forks source link

More strongly-typed HTTP and HTTPS URIs #162

Open torinnd opened 1 year ago

torinnd commented 1 year ago

I find myself reaching for a more strongly typed URI when the scheme is one of "http" or "https". As per https://httpwg.org/specs/rfc9110.html#rfc.section.4.2 the following semantics need to be enforced for absolute URIs with one of these two schemes:

One way this could be addressed is by leveraging a GADT like:

type 'a t

val host : [< `Base | `Http ] t -> string option
val host' : [ `Http ] t -> string

val scheme : [< `Base | `Http ] t -> string option
val scheme' : [ `Http ] t -> [ `Http | `Https ]

This would be a non-trivial API change for URI, of course. A smaller change could be to introduce a submodule into the MLI for an [Absolute_http.h] with appropriate conversion functions:

val of_t : t -> (h, [ `Msg of string]) result
val to_t : h -> t

... and a subset of the [Uri] MLI adapted to the constraints enforced by the scheme (e.g. [make] would require [host], omit [userinfo], and accept a more constrained set of [scheme]'s).

I have a fairly narrow understanding of other use cases for URI (for example, I see there is an old Issue #158 for Websocket support, which feels related), but imagine there are additional considerations besides what I've raised here. The big win for me personally would be a calling pattern where I could be confident that a [host] is present.

avsm commented 1 year ago

Introducing a submodule sounds fine, and we can move it out to a toplevel Absolute_uri in the future. I'm not sure that Absolute is the right prefix though; perhaps Uri_with_host or Uri.With_host is more descriptive but more unwieldy. It would be good to check through the rest of RFC9110 and make sure there's nothing else aside from the host that needs attention, while we're doing this.

torinnd commented 1 year ago

The deprecation of userinfo also seems important for RFC-compliance (https://httpwg.org/specs/rfc9110.html#rfc.section.4.2.4). A rough version of the submodule approach could be:

# mli
(** Specializations for HTTP and HTTPS schemes as per RFC9110 *)
module Absolute_http : sig
  type h

  val of_t : t -> (h, [ `Msg of string ]) result
  val to_t : h -> t

  val of_string : string -> (h, [ `Msg of string ]) result
  val to_string : ?pct_encoder:pct_encoder -> h -> string

  val make : scheme:[ `Http | `Https ]-> host:string ->
    ?port:int -> ?path:string -> ?query:(string * string list) list ->
    ?fragment:string -> unit -> h
end

# ml
module Absolute_http = struct
  type h =
    { scheme : [ `Http | `Https ]
    ; host : Pct.decoded
    ; port : int option
    ; path : Path.t
    ; query : Query.t
    ; fragment : Pct.decoded option
    }

  let ( let* ) = Result.bind

  let to_t { scheme; host; port; path; query; fragment } =
    let scheme =
      match scheme with
      | `Http -> Pct.cast_decoded "http"
      | `Https -> Pct.cast_decoded "https"
    in
    { scheme = Some scheme
    ; userinfo = None
    ; host = Some host
    ; port
    ; path
    ; query
    ; fragment
    }
  ;;

  let of_t { scheme; userinfo; host; port; path; query; fragment } =
    let* scheme =
      match scheme with
      | None -> Error (`Msg "No scheme present in URI")
      | Some scheme ->
        (match Pct.uncast_decoded scheme with
         | "http" -> Ok `Http
         | "https" -> Ok `Https
         | unsupported_scheme ->
           Error
             (`Msg
                (Printf.sprintf
                   "Only http and https URIs are supported. %s is invalid."
                   unsupported_scheme)))
    in
    let* host = Option.to_result ~none:(`Msg "host is required for HTTP(S) uris") host in
    let* () =
      match userinfo with
      | None -> Ok ()
      | Some (_ : string * string option) ->
        Error (`Msg "userinfo is a deprecated field in HTTP(S) URIs as per RFC9110")
    in
    Ok { scheme; host; port; path; query; fragment }
  ;;

  let of_string s = of_string s |> of_t
  let to_string ?pct_encoder h = to_t h |> to_string ?pct_encoder

  let normalize h =
    {h with
        host=Pct.cast_decoded (String.lowercase_ascii (Pct.uncast_decoded h.host))
       }

  let make ~scheme ~host ?port ?path ?query ?fragment () =
    let decode = function
      |Some x -> Some (Pct.cast_decoded x) |None -> None in
    let path = match path with
      |None -> [] | Some p ->
        let path = path_of_encoded p in
        match path with
        |  "/"::_ |  [] -> path
        | _  -> "/"::path
    in
    let query = match query with
      | None -> Query.KV []
      | Some p -> Query.KV p
    in
    normalize
      { scheme;
        host=Pct.cast_decoded host; port; path; query; fragment=decode fragment }
end
avsm commented 1 year ago

I see that Absolute_uri is a term from RFC9110, so we could just call this Uri.Absolute to indicate the mandatory presence of a non-empty path.

avsm commented 1 year ago

I think it might be worth separating the errors out for a deprecated user_info (which clients could validly choose to allow to pass through, after stripping it out) from the missing host error.

torinnd commented 1 year ago

I see that Absolute_uri is a term from RFC9110, so we could just call this Uri.Absolute to indicate the mandatory presence of a non-empty path.

Because the specificity seems important, I want to call out that an Absolute URI implies the presence of a hier-part, not necessarily an authority section (which is where a host identifier can exist):

   hier-part     = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty

I don't think this would satisfy the requirements in RFC9110, so it's probably not specific enough to describe this as a [Uri.Absolute.t]:

A sender MUST NOT generate an "http" URI with an empty host identifier. A recipient that processes such a URI reference MUST reject it as invalid.

I'll also offer mild pushback on the userinfo section, based on the language of the RFC:

A sender MUST NOT generate the userinfo subcomponent (and its "@" delimiter) when an "http" or "https" URI reference is generated within a message as a target URI or field value.

Before making use of an "http" or "https" URI reference received from an untrusted source, a recipient SHOULD parse for userinfo and treat its presence as an error

I suppose there's an argument to be made that the client-side SHOULD isn't strong enough to forbid this, but it seems to me that erroring when attempting to parse a Uri.t as a RFC9110-complaint HTTP(S) URI is aligned with the spirit of the RFC, if not the exact language.

avsm commented 1 year ago

An error doesn't mean that the connection should be aborted; it means that the URI shouldn't be propagated. From that RFC text, it seems valid for me for an HTTP proxy to spot a user-info field, remove it (perhaps authenticating out of band), and to make a downstream request without it. We should make sure our URI interfaces support that. It might be that it happens within Uri and not in Absolute_uri though.

torinnd commented 1 year ago

I've cleaned up the proposed changes above in #164