haskell / network-uri

URI manipulation facilities
Other
25 stars 33 forks source link

Incorrectly (I think) parsed URI from the spec #76

Closed NorfairKing closed 2 years ago

NorfairKing commented 2 years ago

The spec has these examples of URIs:

      ftp://ftp.is.co.za/rfc/rfc1808.txt

      [http://www.ietf.org/rfc/rfc2396.txt](https://www.ietf.org/rfc/rfc2396.txt)

      ldap://[2001:db8::7]/c=GB?objectClass?one

      mailto:John.Doe@example.com

      news:comp.infosystems.www.servers.unix

      tel:+1-816-555-1212

      telnet://192.0.2.16:80/

      urn:oasis:names:specification:docbook:dtd:xml:4.1.2

Some of them are parsed incorrectly (I think):

ghci> uriPath <$> parseURIReference "mailto:John.Doe@example.com"
Just "John.Doe@example.com"
ghci> uriPath <$> parseURIReference "news:comp.infosystems.www.servers.unix"
Just "comp.infosystems.www.servers.unix"
  1. Is this indeed wrong?
  2. Could these be added to the test suite?
chreekat commented 2 years ago

I'm not qualified to say, but those parsings seem to match what's said here:

[3](https://datatracker.ietf.org/doc/html/rfc3986#section-3).  Syntax Components

   The generic URI syntax consists of a hierarchical sequence of
   components referred to as the scheme, authority, path, query, and
   fragment.

      URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

      hier-part   = "//" authority path-abempty
                  / path-absolute
                  / path-rootless
                  / path-empty
   The following are two example URIs and their component parts:

         foo://example.com:8042/over/there?name=ferret#nose
         \_/   \______________/\_________/ \_________/ \__/
          |           |            |            |        |
       scheme     authority       path        query   fragment
          |   _____________________|__
         / \ /                        \
         urn:example:animal:ferret:nose
[3.3](https://datatracker.ietf.org/doc/html/rfc3986#section-3.3).  Path

   The path component contains data, usually organized in hierarchical
   form, that, along with data in the non-hierarchical query component
   ([Section 3.4](https://datatracker.ietf.org/doc/html/rfc3986#section-3.4)), serves to identify a resource within the scope of the
   URI's scheme and naming authority (if any).  The path is terminated
   by the first question mark ("?") or number sign ("#") character, or
   by the end of the URI.
NorfairKing commented 2 years ago

Huh I expected this:

mailto:John.Doe@example.com
\____/ \______/ \_________/ \___/
scheme  userinfo  regname   no path
        \_______________/
          authority

news:comp.infosystems.www.servers.unix
\__/ \_______________________________/ \___/
scheme regname                          no path

Instead of what is currently parsed:

mailto:John.Doe@example.com
\____/ \__________________/ 
scheme     path

news:comp.infosystems.www.servers.unix
\__/ \_______________________________/ 
scheme     path 
ezrakilty commented 2 years ago

@chreekat Is right; these parsings are counterintuitive but are made clear in the spec. There is a grammar and a bit of text to this effect:

      hier-part   = "//" authority path-abempty
                  / path-absolute
                  / path-rootless
                  / path-empty

When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//")

I did some digging to try to justify this choice in the spec, although honestly I'm at a loss. I do think the email address in a mailto: URI is somewhat different in nature from an authority component, even though the two look similar.

All that said, I would like to add some test cases that concretely demonstrate the expected parsings and cite the spec alongside. I'll do that shortly.

ezrakilty commented 2 years ago

OK, a few more test cases have been added demonstrating this behavior. Thanks for prompting it!