servo / rust-url

URL parser for Rust
https://docs.rs/url/
Apache License 2.0
1.33k stars 331 forks source link

Mutating path of URL without authority (idempotency, empty path segments) #984

Open x11x opened 1 month ago

x11x commented 1 month ago

I have found a couple more cases related to #601 (https://github.com/whatwg/url/pull/505; number 3. under 4.5. URL Serializing), affecting the ability to serialize idempotently when the path looks like an authority. (i.e. the web+demo:/.//not-a-host -> web+demo://not-a-host problem demonstrated in the spec).

It occurs when mutating the path of a non-"cannot-be-a-base" URL without an authority.

(note that "web+demo:/" is not "cannot-be-a-base" because path is "/").

use url::Url;

#[test]
fn test_can_be_a_base_with_set_path() {
    let mut url = Url::parse("web+demo:/").unwrap();
    assert!(!url.cannot_be_a_base());

    // Set path to "//not-a-host" using `set_path`
    url.set_path("//not-a-host");

    // PASSES (path is correctly set):
    assert_eq!(url.path(), "//not-a-host");

    // PASSES (has segments):
    let segments: Vec<_> = url.path_segments().expect("should have path segments").collect();
    // PASSES:
    assert_eq!(segments, vec!["", "not-a-host"]);

    // **FAILS**
    // EXPECTED: "web+demo:/.//not-a-host"
    // ACTUAL:   "web+demo://not-a-host"
    assert_eq!(url.as_str(), "web+demo:/.//not-a-host");
}

#[test]
fn test_can_be_a_base_with_path_segments_mut() {
    let mut url = Url::parse("web+demo:/").unwrap();
    assert!(!url.cannot_be_a_base());

    // Set path to "//not-a-host" using `path_segments_mut`
    url.path_segments_mut()
        .expect("should have path segments")
        .push("")  /* NOTE: any number of push("") here appears to make no difference (all ignored) */
        .push("not-a-host");

    // **FAILS**
    // EXPECTED: "web+demo:/.//not-a-host"
    // ACTUAL:   "web+demo:/not-a-host"
    assert_eq!(url.as_str(), "web+demo:/.//not-a-host");

    // **FAILS**
    // EXPECTED: "//not-a-host"
    // ACTUAL:   "/not-a-host"
    assert_eq!(url.path(), "//not-a-host");

    // PASSES (has segments):
    let segments: Vec<_> = url.path_segments().expect("should have path segments").collect();

    // **FAILS**
    // EXPECTED: ["", "not-a-host"]
    // ACTUAL:   ["not-a-host"]
    assert_eq!(segments, vec!["", "not-a-host"]);
}

(sorry for multiple failing assertions in the same test case, i just thought it was clearer to communicate that way)

Have I misread the spec?

I am following the rules in URL Serializing section, but is this different to URL Writing? (should URL Writing be followed for mutating path)? Possibly if you follow URL Writing it would disallow setting this type of path? (I'm not super clear on this).

A path-absolute-URL string must be U+002F (/) followed by a path-relative-URL string. A path-relative-URL string must be zero or more URL-path-segment strings, separated from each other by U+002F (/), and not start with U+002F (/).

But it specifies that URL-path-segment strings may be empty strings, so its a little ambiguous whether "/" followed by empty path segment followed by "/" is allowed. But the fact that this kind of path is explicitly mentioned in URL Serializing makes it seem like this should be allowed.

I also looked at the specification of the URL API pathname setter and it says to parse it using the "basic URL parser" starting in the "path start state", which I think should respect empty path segments and makes it seem like "//not-a-host" should be a valid path to set, and should cause the resultant URL to have "//not-a-host" as its path.

What do other implementations do?

I have not tested widely, just JavaScript URL class in a few runtimes i have handy

var url = new URL("web+demo:/");
url.pathname = "//not-a-host";
console.log(url.href)
// Node v20.17.0 gives:            'web+demo:/.//not-a-host'
// Chromium 129.0.6668.100 gives:  'web+demo:/'

Pushing empty path segment

As well as the not serializing correctly issue, there is possibly another issue with url::PathSegmentsMut::extend/url::PathSegmentsMut::push not being able to append an empty segment onto the path "/" (unless I have misunderstood how it is supposed to work or misunderstood the spec). That maybe should be a separate issue.

Originally posted by @x11x in https://github.com/servo/rust-url/issues/601#issuecomment-2424825458

valenting commented 1 month ago

I think this will be fixed by #943

evilpie commented 1 month ago

This assert:

    // **FAILS**
    // EXPECTED: "web+demo:/.//not-a-host"
    // ACTUAL:   "web+demo:/not-a-host"
    assert_eq!(url.as_str(), "web+demo:/.//not-a-host");

Still fails:

thread 'test_can_be_a_base_with_path_segments_mut' panicked at url/tests/unit.rs:1402:5:
assertion `left == right` failed
  left: "web+demo:/not-a-host"
 right: "web+demo:/.//not-a-host"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
valenting commented 3 weeks ago

@theskim could you possibly also take a look at this bug? It deals with the path setters, not just URL parsing - which I think is the only bit left after we landed #943

theskim commented 3 weeks ago

@theskim could you possibly also take a look at this bug? It deals with the path setters, not just URL parsing - which I think is the only bit left after we landed #943

Yep, I can take a look 👍

valenting commented 3 weeks ago

Thanks!

theskim commented 3 weeks ago

996