Closed gnp closed 5 months ago
Similar to #884 this issue makes a confusion between punycode and IDNA. punycode is merely the encoding https://datatracker.ietf.org/doc/html/rfc3492#section-7 (see example encodings) The "xn--" prefix is part of IDNA https://datatracker.ietf.org/doc/html/rfc5890
The punycode module does only punycode.
to_ascii
does IDNA.
Note you can confirm the punycode behaviour in python too:
"\u2655".encode("punycode").decode("ascii") # returns 'z5h'
@valenting If you don't consider this a bug, would you consider it a feature request? In my application I want to (a) preserve as much as possible the capitalization of the domain name entered by the user; and (b) only allow into the system domains that are considered valid by the rules of IDNA, if one is entered that does use Unicode (for example, respecting the length limitations on each label post encoding and the entire domain name as well; and the Unicode normalization though I would pre-normalize that for other reasons.
I did not see a way with this crate to achieve that result. Python's punycode will pass through uppercase ASCII characters, perhaps Rust's could as well, with some Config setting if you don't want that to be the default.
I want to take user input like "SomeCompany.com" (or something with Unicode in it) and aside from trimming just in case and putting through Unicode normalization retain that in my database -- as long as running through IDNA results in a valid domain.
And, I also want to store the normalized version (in which case the lowercasing is exactly what I want as well). So that I have both the user-preferred capitalization and the normalized version at hand so if people type the same domain multiple ways they all work.
At the moment I do not see how to achieve these goals with the idna crate, but it does seem to have all the bits that would be needed if only it could be convinced to leave ASCII uppercase alone when needed.
Case insensitive but case preserving is my intent for my app.
Your use case seems pretty niche - and I don't think it really belongs in the Url crate.
But I don't see a reason why it couldn't be implemented using the primitives provided by idna::punycode
.
As you mentioned in the first comment, punycode::encode_str(input)
maintains capitalization.
The following tests fail. Expected result is one of
punycode::encode_str
orConfig::...::to_ascii
would (a) return a complete punycode output including the "xn--" prefix and no unexpected suffixes; and (b) leave case of ASCII input characters as-is. Or, that a third method would be available to achieve that result.punycode::encode_str
doesn't put the "xn--" prefix when needed, though it does leave ASCII case as-is, as expected. However, it does put a spurrious "-" after "EXAMPLE".Config::default().to_ascii(...)
Does add the "xn--" prefix when needed, but also lowercases ASCII which is not desired. It does not add a spurious "-" at the end.