golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.95k stars 17.66k forks source link

net/url: make URL parsing return an error on IPv6 literal without brackets #31024

Open noneymous opened 5 years ago

noneymous commented 5 years ago

What version of Go are you using (go version)?

go version go1.11.2 windows/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GOARCH=amd64
set GOHOSTARCH=amd64
set GOHOSTOS=windows

What did you do?

url.URL provides the methods .Hostname() and .Port() which schould split the URL's host (.Host) into its address and port part. In certain cases, these functions are not able to interpret IPv6 addresses correctly, leading to invalid output.

Here is a test function feeding different sample URLs, demonstrating the issue (all test URLs are valid and should succeed!):

func TestUrl(t *testing.T) {
    tests := []struct {
        name        string
        input string
        wantHost     string
        wantPort string
    }{
        {"domain-unknonw-scheme", "asfd://localhost/?q=0&p=1#frag", "localhost", ""},
        {"domain-implicit", "http://localhost/?q=0&p=1#frag", "localhost", ""},
        {"domain-explicit", "http://localhost:80/?q=0&p=1#frag", "localhost", "80"},
        {"domain-explicit-other-port", "http://localhost:80/?q=0&p=1#frag", "localhost", "80"},

        {"ipv4-implicit", "http://127.0.0.1/?q=0&p=1#frag", "127.0.0.1", ""},
        {"ipv4-explicit", "http://127.0.0.1:80/?q=0&p=1#frag", "127.0.0.1", "80"},
        {"ipv4-explicit-other-port", "http://127.0.0.1:80/?q=0&p=1#frag", "127.0.0.1", "80"},

        {"ipv6-explicit", "http://[1::]:80/?q=0&p=1#frag", "1::", "80"},
        {"ipv6-explicit-other-port", "http://[1::]:80/?q=0&p=1#frag", "1::", "80"},

        {"ipv6-implicit-1", "http://[1::]/?q=0&p=1#frag", "1::", ""},

        {"ipv6-implicit-2", "http://1::/?q=0&p=1#frag", "1::", ""},
        {"ipv6-implicit-3", "http://1::2008/?q=0&p=1#frag", "1::2008", ""},
        {"ipv6-implicit-4", "http://1::2008:1/?q=0&p=1#frag", "1::2008:1", ""},
    }
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // Prepare URL
            u, err := url.Parse(tt.input)
            if err != nil {
                t.Errorf("could not parse url: %v", err)
                return
            }

            // Extract hostname and port
            host := u.Hostname()
            port := u.Port()

            // Compare result
            if host != tt.wantHost {
                t.Errorf("TestUrl() got = %v, want %v", host, tt.wantHost)
            }
            if port != tt.wantPort {
                t.Errorf("TestUrl() got1 = %v, want %v", port, tt.wantPort)
            }
        })
    }
}

Output: grafik

What did you expect to see?

All sample URLs in the test cases above are valid ones, hence, all tests should succeed as defined

What did you see instead?

The top test samples work as expected, however, the bottom three return incorrect results. The bottom three samples are valid IPv6 URLs with inplicit port specification, but .Hostname() and .Port() interpres them as IPv4 addresses returning parts of the IPv6 address as if it was the explicit port of an IPv4 input. E.g., in one of the test outputs, ":2008" is returned as the port, but it is actually part of the IPv6 address.

Where is the bug?

.Hostname() and .Port() implement their own logic to split the port from the hostname. I've found that there is already a close function in the net package, called net.SplitHostPort(), which does it's job correctly. If .Hostname() and .Port() just called that function instead of re-implementing the logic, everything should work as expected. Below is the proof in form of a test function with exactly the same inputs as above, but using net.SplitHostPort() instead of .Hostname() / .Port().

func TestUrlNetSplit(t *testing.T) {
    tests := []struct {
        name        string
        input string
        wantHost     string
        wantPort string
    }{
        {"domain-unknonw-scheme", "asfd://localhost/?q=0&p=1#frag", "localhost", ""},
        {"domain-implicit", "http://localhost/?q=0&p=1#frag", "localhost", ""},
        {"domain-explicit", "http://localhost:80/?q=0&p=1#frag", "localhost", "80"},
        {"domain-explicit-other-port", "http://localhost:80/?q=0&p=1#frag", "localhost", "80"},

        {"ipv4-implicit", "http://127.0.0.1/?q=0&p=1#frag", "127.0.0.1", ""},
        {"ipv4-explicit", "http://127.0.0.1:80/?q=0&p=1#frag", "127.0.0.1", "80"},
        {"ipv4-explicit-other-port", "http://127.0.0.1:80/?q=0&p=1#frag", "127.0.0.1", "80"},

        {"ipv6-explicit", "http://[1::]:80/?q=0&p=1#frag", "1::", "80"},
        {"ipv6-explicit-other-port", "http://[1::]:80/?q=0&p=1#frag", "1::", "80"},

        {"ipv6-implicit-1", "http://[1::]/?q=0&p=1#frag", "1::", ""},

        {"ipv6-implicit-2", "http://1::/?q=0&p=1#frag", "1::", ""},
        {"ipv6-implicit-3", "http://1::2008/?q=0&p=1#frag", "1::2008", ""},
        {"ipv6-implicit-4", "http://1::2008:1/?q=0&p=1#frag", "1::2008:1", ""},
    }
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // Prepare URL
            u, err := url.Parse(tt.input)
            if err != nil {
                t.Errorf("could not parse url: %v", err)
                return
            }

            host, port, err := net.SplitHostPort(u.Host)
            if err != nil {
                // Could not split port from host, as there is no port specified
                host = u.Host
                port = ""
            }

            // Compare result
            if host != tt.wantHost {
                t.Errorf("TestUrl() got = %v, want %v", host, tt.wantHost)
            }
            if port != tt.wantPort {
                t.Errorf("TestUrl() got1 = %v, want %v", port, tt.wantPort)
            }
        })
    }
}

Output: grafik

dmitshur commented 5 years ago

/cc @rsc @bradfitz (per net/url owners)

bradfitz commented 5 years ago

Your ipv6-implicit-2, ipv6-implicit-3, and ipv6-implicit-4 aren't valid URLs. IPv6 literals need to be in square brackets.

bradfitz commented 5 years ago

Closing for now. Let me know if I'm wrong, though.

noneymous commented 5 years ago

Thank you for your response @bradfitz !

The rfc2732 recommends the use of brackets in all cases. However, it is not mandatory and can be technically omitted if no explicit port needs to be defined.

The thing is: In many cases the URL might not be built by the developer, but be received from some other source. Using url.Parse() is already an small indicator that URLs are received and parsed rather than built by the developer.

One common example, would be parsing URLs returned by a website. The website is free to return any URL format (can be decided by the web application developer). Relative URLs but also absolute ones. If the website returns absolute URLs, they might not follow the recommendation of the RFC but omit the brackets. url.Parse() would not return an error but garbage difficult to catch.

Sanitizing every received URL before parsing would be a fix, but unnecessasry overhead. It would be related to re-implementing the url.Parse() function.

The following might be a less valid argument in the views of some, but there is already the net.SplitHostPort() function in a closeby package. It is a little more forgiving and having duplicate code might not be the best practice either.

It would be also okay if parsing "http://1::2008" and calling .Hostname() returned "[1::2008]" in compliance with the RFC's suggestion for

rfc2732 : https://www.ietf.org/rfc/rfc2732.txt definition of "should": https://tools.ietf.org/html/rfc2119

bradfitz commented 5 years ago

Well, that SHOULD is news to me. I thought it was a requirement. I think it might've been upgraded to a requirement in later (post 1999) RFCs.

In any case,

3. SHOULD   This word, or the adjective "RECOMMENDED", mean that there
   may exist valid reasons in particular circumstances to ignore a
   particular item, but the full implications must be understood and
   carefully weighed before choosing a different course.

Who decided that the valid reasons were to not use square brackets? Did they carefully weigh that libraries (such as Go) might reject them?

Is there any other software today that generates them? That is, how did this bug come about? Is this real or hypothetical?

noneymous commented 5 years ago

It was hypothetical based on the RFC's definition and that IPv6 addresses go without brackets by default unless a specific endpoint (port) is to be specified.

However, I now tried how browsers (Chrome, FF, IE, Edge) react on IPv6 URLs without brackets. It seems they don't accept it and understand such input as search terms. Obviously, the general understanding is that brackets are mandatory for IPv6 URLs. Hence, no working website could apply it without. Web developers could build it but it wouldn't be useful.

If being forgiving is not desired in this case, I'm suggesting, that url.Parse() throws an error in order to prevent from continuing with garbage data! See garbage output in sample below.

func main() {
    u, err := url.Parse("http://1::12:2008")
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(u.Hostname()) // => 1
    fmt.Println(u.Port()) // => :12:2008
}

(If the port is not empty or an integer, something went bad.)

bradfitz commented 5 years ago

Okay, I've repurposed this bug to be about rejecting them instead.

philandstuff commented 5 years ago

I'm pretty sure that RFC 3986 (which obsoletes RFC 2732) requires square brackets for IPv6 literals. The ABNF rules defined in RFC 3986 only allow an IPv6address inside square brackets, and there is nothing in the text to suggest that a reg-name that looks like an IPv6address should be interpreted that way.

This silence on IPv6 addresses without square brackets is notable, because the text does acknowledge the ambiguity between reg-name and IPv4address:

The syntax rule for host is ambiguous because it does not completely distinguish between an IPv4address and a reg-name. In order to disambiguate the syntax, we apply the "first-match-wins" algorithm: If host matches the rule for IPv4address, then it should be considered an IPv4 address literal and not a reg-name.

noneymous commented 5 years ago

Well, an IPv6 address cannot be interpreted as a hostname due to the colons, in contrast to an IPv4 address. Anyways, it seems clear now that brackets are required to be compliant.

Any suggestions how to make the function detect and reject mentioned invalid URLs? Validating the host part might be error prone because its content could be IPv4, IPv6 or a hostname. The error actually materializes in the port part. So I thought, a test whether the port is either empty or a valid integer would catch the issue. It would also catch potential other issues not directly related with IPv6 and brackets. But I can't help it feels too cheap.

philandstuff commented 5 years ago

I would suggest that RFC 3986 provides the solution - it is the literal source of truth for what can be a valid URL. That does mean parsing it as IPv4, IPv6, IPvFuture (yes, that's a thing), or a reg-name.

I note also that currently, net/url thinks that https://[::y] is a valid URL, which is another non-RFC-3986-compliant URL string.

gopherbot commented 5 years ago

Change https://golang.org/cl/184117 mentions this issue: net/url: Parse returns an error on IPv6 literal without brackets

gopherbot commented 5 years ago

Change https://golang.org/cl/189258 mentions this issue: net/url: make Hostname and Port predictable for invalid Host values

Neustradamus commented 3 years ago

Have you progressed on this bug?

dekimsey commented 3 years ago

Not to go off-tangent too much. But what is the intent in url.Parse on strings that have no scheme? I came across this in a library that was using url.Parse on ipv4 addresses and would fail when given implicit or explicit ipv6 addresses. The library should not have been using url.Parse on plain ip address strings, but it did make me wonder why net.Parse was not returning an error.

For example, url.Parse("127.0.0.1") parses, but returns url.URL{Path: "127.0.0.1"} which struck me as nonsensical. When I checked RFC3986 Section 3.1 and Section 1.1.1 I got the impression scheme has to exist to be a valid URL.

The behavior I'm looking at is demonstrated here: https://play.golang.org/p/MzW9lCD7Min

noneymous commented 3 years ago

Not to go off-tangent too much. But what is the intent in url.Parse on strings that have no scheme?

I'm using url.Parse to parse hrefs extracted from web sites, but they can be defined in various ways (and are seldomly with scheme):

Different web developers or content management systems may chose different variants and or a combination for reasons.

slrz commented 3 years ago

For example, url.Parse("127.0.0.1") parses, but returns url.URL{Path: "127.0.0.1"} which struck me as nonsensical. When I checked RFC3986 Section 3.1 and Section 1.1.1 I got the impression scheme has to exist to be a valid URL.

5.2. Relative Resolution allows those as URI references. The ResolveReference method on url.URL strongly hints at the intention to support this in net/url.

The parsing of 127.0.0.1 seems to be correct as per the above.