servo / rust-url

URL parser for Rust
https://docs.rs/url/
Apache License 2.0
1.31k stars 325 forks source link

Url with quote after schema is getting parsed as valid. #932

Closed mxaddict closed 4 months ago

mxaddict commented 4 months ago

Have this code:

use url::Url;

fn main() {
    let url_str = "http://\"https//test.com/some/path";

    match Url::parse(url_str) {
        Ok(url) => {
            // Proceed with using the URL
            println!("Valid URL: {}", url);
        },
        Err(e) => {
            // Handle the error, log it, etc.
            eprintln!("Invalid URL: {}", e);
        }
    }
}

Expected this to parse and say that the url is invalid.

Instead it's parsing and saying it's valid.

See attached screenshot: image

mxaddict commented 4 months ago

Printing the host value it's returning "https Which is not a valid host if i recall correctly from the URL standards. image

valenting commented 4 months ago

According to the URL standard, it's a valid URL. https://jsdom.github.io/whatwg-url/#url=aHR0cDovLyJodHRwcy8vdGVzdC5jb20vc29tZS9wYXRo&base=YWJvdXQ6Ymxhbms=

" is not an invalid character.

mxaddict commented 4 months ago

Hmm, so the issue is with reqwest then, cause I'm passing URL to reqwest after parsing with url::Url::parse(), and reqwest is complaining about invalidUriChar

mxaddict commented 4 months ago

Moving the bug report to reqwest then, thanks for the clarification

mxaddict commented 4 months ago

Hmm, after doing some more digging, seems that quotes are not a valid hostname character: Check RFC 3986.

I also tested in chrome and it does not thing that hostname with quotes is a valid url, just paste the test url from my code and test it in chrome.

valenting commented 4 months ago

Well, it's a valid by the URL standard. But " is not a valid DNS name character, so obviously the request will fail. I'm assuming that's why it's failing, and not exactly while parsing the URL - which as you observed, does successfully parse.

valenting commented 4 months ago

Hmm, after doing some more digging, seems that quotes are not a valid hostname character: Check RFC 3986.

I also tested in chrome and it does not thing that hostname with quotes is a valid url, just paste the test url from my code and test it in chrome.

Firefox and Chrome don't accept the character either. But still the URL standard says it's valid and that's what this crate implements.

mxaddict commented 4 months ago

I see, so it's allowed as a URI but not allowed by the HTTP url standard?

Hmmm, not sure where the fix for this should be then.

mxaddict commented 4 months ago

I think this section of RFC 3986 might apply here:

host        = IP-literal / IPv4address / reg-name

IP-literal  = "[" ( IPv6address / IPvFuture  ) "]"

IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet

reg-name    = *( unreserved / pct-encoded / sub-delims )

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="
mxaddict commented 4 months ago

Wouldn't this imply that hostname should be only apha-numeric with the allowed chars?

I don't see how " would be valid in this case.

mxaddict commented 4 months ago

Anyway, thanks for the help, might have to add a second layer of validation on my code between rust-url parse and reqwest lib.

This seems like a niche situation 🤣

mxaddict commented 4 months ago

I just added this to my code after the URL parse step (which should work for my usecase)

                    if url_start.host_str().unwrap().contains("\"") {
                        update_url(&pool, url, &vec![], "", URL_ERR_HOST).await;
                        thread_error_count += 1;
                        thread_done_count += 1;
                        continue;
                    }
mxaddict commented 4 months ago

Thanks for the info, love this crate by the way, has helped me a ton! 👍🏼