servo / rust-url

URL parser for Rust
https://docs.rs/url/
Apache License 2.0
1.27k stars 317 forks source link

Neither punycode::encode_str nor Config::...::to_ascii return expected results for single Unicode char and "EXAMPLE" #900

Closed gnp closed 5 months ago

gnp commented 5 months ago

The following tests fail. Expected result is one of punycode::encode_str or Config::...::to_ascii would (a) return a complete punycode output including the "xn--" prefix and no unexpected suffixes; and (b) leave case of ASCII input characters as-is. Or, that a third method would be available to achieve that result.

punycode::encode_str doesn't put the "xn--" prefix when needed, though it does leave ASCII case as-is, as expected. However, it does put a spurrious "-" after "EXAMPLE".

Config::default().to_ascii(...) Does add the "xn--" prefix when needed, but also lowercases ASCII which is not desired. It does not add a spurious "-" at the end.

    #[test]
    fn punycode_should_have_prefix() {
        use idna::punycode;

        let input = "\u{2655}"; // White Chess Queen
        let expected = "xn--z5h";
        let result = punycode::encode_str(input).unwrap(); /// Does not work. Returns "z5h".
        // let result = Config::default().to_ascii(input).unwrap(); // Does work. Returns "xn--z5h".
        assert_eq!(result, expected);
    }

    #[test]
    fn punycode_should_not_lowercase_ascii() {
        use idna::punycode;

        let input = "EXAMPLE";
        let expected = "EXAMPLE";
        let result = punycode::encode_str(input).unwrap(); // Does not work. Returns "EXAMPLE-".

        // let result = Config::default().to_ascii(input).unwrap(); // Does not work. Returns "example".

        assert_eq!(result, expected);
    }
valenting commented 5 months ago

Similar to #884 this issue makes a confusion between punycode and IDNA. punycode is merely the encoding https://datatracker.ietf.org/doc/html/rfc3492#section-7 (see example encodings) The "xn--" prefix is part of IDNA https://datatracker.ietf.org/doc/html/rfc5890

The punycode module does only punycode. to_ascii does IDNA.

Note you can confirm the punycode behaviour in python too:

"\u2655".encode("punycode").decode("ascii") # returns 'z5h'
gnp commented 5 months ago

@valenting If you don't consider this a bug, would you consider it a feature request? In my application I want to (a) preserve as much as possible the capitalization of the domain name entered by the user; and (b) only allow into the system domains that are considered valid by the rules of IDNA, if one is entered that does use Unicode (for example, respecting the length limitations on each label post encoding and the entire domain name as well; and the Unicode normalization though I would pre-normalize that for other reasons.

I did not see a way with this crate to achieve that result. Python's punycode will pass through uppercase ASCII characters, perhaps Rust's could as well, with some Config setting if you don't want that to be the default.

I want to take user input like "SomeCompany.com" (or something with Unicode in it) and aside from trimming just in case and putting through Unicode normalization retain that in my database -- as long as running through IDNA results in a valid domain.

And, I also want to store the normalized version (in which case the lowercasing is exactly what I want as well). So that I have both the user-preferred capitalization and the normalized version at hand so if people type the same domain multiple ways they all work.

At the moment I do not see how to achieve these goals with the idna crate, but it does seem to have all the bits that would be needed if only it could be convinced to leave ASCII uppercase alone when needed.

Case insensitive but case preserving is my intent for my app.

valenting commented 5 months ago

Your use case seems pretty niche - and I don't think it really belongs in the Url crate. But I don't see a reason why it couldn't be implemented using the primitives provided by idna::punycode. As you mentioned in the first comment, punycode::encode_str(input) maintains capitalization.