lo48576 / iri-string

String types for URIs/IRIs.
Apache License 2.0
17 stars 3 forks source link

Decode percent encodings for non-ASCII characters in `iunreserved` category on normalization #19

Closed lo48576 closed 2 years ago

lo48576 commented 2 years ago

IRIs are defined similarly to URIs in [RFC3986], but the class of unreserved characters is extended by adding the characters of the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject to the limitations given in the syntax rules below and in section 6.1.

--- RFC 3987 section 2.1. Summary of IRI Syntax

These IRIs should be normalized by decoding any percent-encoded octet sequence that corresponds to an unreserved character, as described in section 2.3 of [RFC3986].

--- RFC 3987 section 5.3.2.3. Percent-Encoding Normalization

RFC 3987 says that percent-encoded octet sequences that corresponds to unreserved characters in IRI (including non-ASCII characters in iunreserved category) should be decoded. However, current implementation of this crate just decodes ASCII unreserved characters and does not care iunreserved characters.

lo48576 commented 2 years ago

This blocks #18.