servo / rust-url

URL parser for Rust
https://docs.rs/url/
Apache License 2.0
1.31k stars 325 forks source link

Bug: parser should not add trailing slash unconditionally #808

Closed wcarmon closed 1 year ago

wcarmon commented 1 year ago

The parser is incorrectly adding trailing slash here: https://github.com/servo/rust-url/blob/1c1e406874b3d2aa6f36c5d2f3a5c2ea74af9efb/url/src/parser.rs#L1143

Urls with and without a trailing slash are different, (although some servers choose to treat them the same way)

If this wasn't a mistake, this behavior should be a ParseOption, so we can interop with other languages.

Most other languages don't do this.

Here are some little test programs in other popular languages which demonstrate expected behavior

Golang example

package main

import (
    "fmt"
    "net/url"
)

func main() {
    uStr := "https://www.abc.net"
    parsed, _ := url.Parse(uStr)

    got := fmt.Sprintf("%v", parsed)

    if uStr != got {
        fmt.Printf("oops, should be the same: got=%v", got)  // never executes
    }
    fmt.Printf("matches exactly as expected")
}

Java example

  public static void main(String[] args) throws Exception {
        var uStr = "http://www.abc.net";
        java.net.URL url = new URL(uStr);
        java.net.URI uri = URI.create(uStr);

        if (!uStr.equals(url.toString())) {
            throw new RuntimeException("url should match exactly"); // never executes
        }

        if (!uStr.equals(uri.toString())) {
            throw new RuntimeException("uri should match exactly"); // never executes
        }

        System.out.println("both match exactly as expected");
  }

Rust example showing the bug

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn should_round_trip() -> Result<(), anyhow::Error> {
        let want = "https://www.abc.net";

        let u = Url::parse(want)?;  // bug happens deep in here

        let got = u.to_string();
        assert_eq!(want, got); // boom!

        Ok(())
    }
}
valenting commented 1 year ago

According the the URL spec and the reference implementation this is the correct behaviour.

tmccombs commented 1 year ago

A request to "http://example.com" would always have a path of / in the http request. An http request with an empty string as the path isn't valid.

wcarmon commented 1 year ago

If you'd rather not change your code, that is 100% your call; It's your repo.

I don't see any standard mentioning what either of you described. The standards either don't mention it or are counter to your comments above.

tmccombs commented 1 year ago

At first I thought the same as you, that the whatwg spec didn't add the trailing slash . But on a closer reading, if you follow the algorithm, it does, but it is subtle.

Basically, when parsing in the "path" state, if it reaches the end of the url, it appends the current buffer (in this case an empty string) to the array of path segments. So the deserialiazed path is a single empty segment. Then during serialization, that single segment is added with a "/" prefix .

Now, that said, it's impossible to know just from the text of the spec if this behavior was intentional, or not. It might be a bug in the spec. Or maybe the algorithm was specifically designed not to allow an empty path. It isn't clear to me.

You could file a ticket for the whatwg repo to either explicitly call out how the current algorithm handles this case, or change it to not add an empty segment to the path if the path doesn't contain any "/”.

I'm not sure how the nginx and apache links are relevant. Rust-url is only adding this trailing slash if the path is empty. Having a server that expects an a path of "" instead of "/" would be very unusual. And I think such a request wouldn't comply with the HTTP specs (although I haven't investigated that).

I'm curious what your use case is where this causes problems.

0xNF commented 7 months ago

Here's one use case where this causes issues:

Google OAuth2 checks the RedirectURL byte-for-byte. If a user registers a redirect url without a trailing slash, then it becomes impossible to use the Rust URL library to successfully authenticate with Google.

tmccombs commented 7 months ago

That sounds like a bug in google oauth2 to me. It should probably treat example.com and example.com/ the same.

0xNF commented 7 months ago

I agree, it does sound like a bug in Google's Oauth implementation. Sadly, Box.com suffers from the same problem.

Much of the internet is broken, and the strictness of the url library makes interfacing with the broken parts of the internet impossible.

valenting commented 7 months ago

Google OAuth2 checks the RedirectURL byte-for-byte. If a user registers a redirect url without a trailing slash, then it becomes impossible to use the Rust URL library to successfully authenticate with Google.

As far as I know it's the developer who registers the redirect-url, so they should make sure the path is correct, no?

0xNF commented 7 months ago

Sure, it'd be nice if everyone did the right thing all the time. But then, Rust itself probably wouldn't exist either, let alone this URL library.

My application isn't in control of what the user does on other systems. I can only work with the information they give me. If the user gives me a URL with no trailing slash, and my program adds one and tries to interface with Google (and box, and others), and they say "no", then my program is broken from their perspective.