lo48576 / iri-string

String types for URIs/IRIs.
Apache License 2.0
15 stars 3 forks source link

How should case normalization work for domains that is US-ASCII only before decoded, but not so after decoded? #38

Closed lo48576 closed 1 week ago

lo48576 commented 5 months ago

Reported at #36.


  • Normalizing "a://%99B/" yields "a://%99B/". The result is supposed to be "a://%99b/" IIUC.

—— https://github.com/lo48576/iri-string/issues/36#issue-2241246314

When an IRI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and US-ASCII only host are case insensitive and therefore should be normalized to lowercase. (snip) Case equivalence for non-ASCII characters in IRI components that are IDNs are discussed in section 5.3.3.

—— RFC 3987, §5.3.2.1 Case Normalization

https://github.com/lo48576/iri-string/blob/021fce896eed51d388161170fdd30c82cd664ba8/src/normalize.rs#L513-L521 https://github.com/lo48576/iri-string/blob/021fce896eed51d388161170fdd30c82cd664ba8/src/parser/trusted.rs#L465-L470

So... Is %99B "US-ASCII only"? The current code considers it's not, because 0x99 is not a valid US-ASCII character. I don't remember how non-decoded percent-encoding should be handled, so I need to do more research.

—— https://github.com/lo48576/iri-string/issues/36#issuecomment-2053675842

I see why our implementations differ in the second case now. I haven't taken a proper look at RFC 3987 and wrote my code based solely on RFC 3986 which only says "the scheme and host are case-insensitive" instead of "the scheme and US-ASCII only host are case insensitive". I also have no idea what "US-ASCII only" means in that context.

—— https://github.com/lo48576/iri-string/issues/36#issuecomment-2053688909

lo48576 commented 5 months ago

I think this could be ambiguity of the spec (or just my carelessness), but anyway I need more investigation.

I hope to maintain consistency between URIs (RFC 3986) and IRIs (RFC 3987), especially when they are written by US-ASCII characters only, so I think the normalization result of a://%99B/ should be identical between them. That is, the implementation for both URIs and IRIs could be changed even though "US-ASCII only" condition is present only in RFC 3987.

lo48576 commented 1 week ago

My current opinion is, a://%99B/ should be normalized to itself, but not a://%99b/.

Although host is case-insensitive, producers and normalizers should use lowercase for registered names and hexadecimal addresses for the sake of uniformity, while only using uppercase letters for percent-encodings.

RFC 3986 §3.2.2

From the phrase "should use lowercase for registered name", I think the case normalization to lower case should only apply to the decoded domain names (registered names). Case normalization on percent-encoded triplets is another layer than the normalization on "host" (registered name).

When an IRI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and US-ASCII only host are case insensitive and therefore should be normalized to lowercase. (snip) Case equivalence for non-ASCII characters in IRI components that are IDNs are discussed in section 5.3.3.

RFC 3987 §5.3.2.1

So this "US-ASCII only (host)" should also be interpreted as US-ASCII only registered name, and non-US-ASCII-only hosts in percent-encoded ASCII representation should fall into the same category as "non-ASCII characters in IRI components that are IDNs", even if they are not encoded using IDN (RFC 3490).

lo48576 commented 1 week ago

Closing as not bug. If you think this is wrong or questionable, feel free to reopen or submit a new issue.