servo / rust-url

URL parser for Rust
https://docs.rs/url/
Apache License 2.0
1.27k stars 317 forks source link

join() broken handling of URL with colon in path #869

Closed Koston-0xDEADBEEF closed 8 months ago

Koston-0xDEADBEEF commented 8 months ago

Here is a typical base Url: https://example.org/lorem/ where attempting to join additional path component: WEB-v2:gizmo.

Expected: https://example.org/lorem/WEB-v2:gizmo. Instead, Url::join() produces a rather mangled new struct; see below for details.

Using set_path() with same resulting path component works correctly. Also, join() works correctly, if the colon in path is URL-encoded (%3A).

Documentation mentions URL-encoding being a part of join() and several other methods, but I have not been able to get any of them to encode a colon.

Code to reproduce + output:

/*
[dependencies]
url = "2.4.1"
*/

use url::Url;

fn main() {
    let bs: &str = "https://example.org/lorem/";
    let p1: &str = "WEB-v2:gizmo";
    let mut base = Url::parse(bs).unwrap();
    let joined1 = base.join(p1).unwrap();
    eprintln!("/*");
    eprintln!("bs: {}", bs);
    eprintln!("p1: {}", p1);
    eprintln!("base Url\n{:#?}\n", base);
    eprintln!("p1 Url::join()\n{:#?}\n", joined1);
    base.set_path((base.path().to_owned()+p1).as_str());
    eprintln!("p1 Url::set_path()\n{:#?}\n", base);
    eprintln!("*/");
}

/*
bs: https://example.org/lorem/
p1: WEB-v2:gizmo
base Url
Url {
    scheme: "https",
    cannot_be_a_base: false,
    username: "",
    password: None,
    host: Some(
        Domain(
            "example.org",
        ),
    ),
    port: None,
    path: "/lorem/",
    query: None,
    fragment: None,
}

p1 Url::join()
Url {
    scheme: "web-v2",
    cannot_be_a_base: true,
    username: "",
    password: None,
    host: None,
    port: None,
    path: "gizmo",
    query: None,
    fragment: None,
}

p1 Url::set_path()
Url {
    scheme: "https",
    cannot_be_a_base: false,
    username: "",
    password: None,
    host: Some(
        Domain(
            "example.org",
        ),
    ),
    port: None,
    path: "/lorem/WEB-v2:gizmo",
    query: None,
    fragment: None,
}

*/
valenting commented 8 months ago

The problem is that "WEB-v2:gizmo" parses as a URL.

I get the same results using the Reference URL parser

Or using JS in the browser:

let base = new URL("https://example.org/lorem/");
// undefined
let u = new URL("WEB-v2:gizmo", base)
// undefined
u.href
// "web-v2:gizmo" 

If you need to use join just prepend ./ to the string you want to use if you know it's a path. Or use set_path as you indicated you can do.

Koston-0xDEADBEEF commented 8 months ago

I understand the logic, but is the join() method really serving a purpose by being so dumb?

Anyone who looks at URL "https://example.org/lorem/" and is asked to join "WEB-v2:gizmo" to it can come up with the intended result. But this method can not, even though it's the explicit, stated purpose of a method from a library dedicated to parsing URLs.

valenting commented 8 months ago

I agree that the API design is not great. Ideally this wouldn't be a method on the URL object, but a static name like Url::parse_with_base that makes it clear what the method does. Absent that, you have the documentation: https://docs.rs/url/latest/url/struct.Url.html#method.join

Parse a string as an URL, with this URL as the base URL.

Koston-0xDEADBEEF commented 8 months ago

Makes sense. Thanks for explanation!