One does not simply "compare" Unicode strings

cajun-rat commented 5 years ago

In a couple of places in the spec we talk about comparing strings. Since these are likely to be Unicode, there is not a single method to perform a comparison. We should be explicit about which Unicode canonicalization should be used, and which comparison algorithm is implied when we say that a pair of 'Hardware Identifiers match' or a delegation's wildcard path matches a target.

iramcdonald commented 5 years ago

Hi,

And to be compatible with all IETF standards-track protocols (including TLS), all strings MUST be created by the Sender and normalized by the Receiver into UTF-8 encoding (RFC 3629) and normalized to comply w/ Net Unicode (RFC 5198) into Unicode Normalization Form C (NFC) defined in:

http://www.unicode.org/reports/tr15/

That "C" stands for "composed" (i.e, if a Unicode code point is defined that combines two glyphs, then use that code point). Note that Apple macOS (for historical reasons) instead uses "NFD" (decomposed) which makes strings longer and (without careful reordering) can introduce ambiguities because of the ordering of the decomposed code points.

Note: The use of UTF-16 (RFC 2781) MUST be PROHIBITED entirely in any Uptane implementation. A number of commercial applications still generate UTF-16, so a string conversion and normalization library on both the Sender and the Receiver is necessary.

Cheers,

Ira

Ira McDonald (Musician / Software Architect) Co-Chair - TCG Trusted Mobility Solutions WG Co-Chair - TCG Metadata Access Protocol SG Chair - Linux Foundation Open Printing WG Secretary - IEEE-ISTO Printer Working Group Co-Chair - IEEE-ISTO PWG Internet Printing Protocol WG IETF Designated Expert - IPP & Printer MIB Blue Roof Music / High North Inc http://sites.google.com/site/blueroofmusic http://sites.google.com/site/highnorthinc mailto: blueroofmusic@gmail.com PO Box 221 Grand Marais, MI 49839 906-494-2434

On Thu, Feb 14, 2019 at 10:13 AM cajun-rat notifications@github.com wrote:

In a couple of places in the spec we talk about comparing strings. Since these are likely to be Unicode, there is not a single method to perform a comparison. We should be explicit about which Unicode canonicalization should be used, and which comparison algorithm is implied when we say that a pair of 'Hardware Identifiers match' or a delegation's wildcard path matches a target.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uptane/uptane-standard/issues/42, or mute the thread https://github.com/notifications/unsubscribe-auth/ATe6O9VHW7OrPphVzx7jjaU-iaEMFU_Nks5vNX0NgaJpZM4a7zZ9 .

mnm678 commented 5 years ago

Hi,

In the TUF area, we have been discussion creating shared wireline formats to allow for interoperability between implementations (details here: https://github.com/theupdateframework/taps/blob/21a2ee49b395346789074cc8ad8b73b5f89e5b0f/tap11.md). I think a version of this could be useful in allowing Uptane users to specify Unicode canonicalization (and comparison method) for their implementation. This might prevent the need for string conversion on ECUs.

-Marina

On Thu, Feb 14, 2019 at 10:52 AM iramcdonald notifications@github.com wrote:

Hi,

And to be compatible with all IETF standards-track protocols (including TLS), all strings MUST be created by the Sender and normalized by the Receiver into UTF-8 encoding (RFC 3629) and normalized to comply w/ Net Unicode (RFC 5198) into Unicode Normalization Form C (NFC) defined in:

http://www.unicode.org/reports/tr15/

That "C" stands for "composed" (i.e, if a Unicode code point is defined that combines two glyphs, then use that code point). Note that Apple macOS (for historical reasons) instead uses "NFD" (decomposed) which makes strings longer and (without careful reordering) can introduce ambiguities because of the ordering of the decomposed code points.

Note: The use of UTF-16 (RFC 2781) MUST be PROHIBITED entirely in any Uptane implementation. A number of commercial applications still generate UTF-16, so a string conversion and normalization library on both the Sender and the Receiver is necessary.

Cheers,

Ira

Ira McDonald (Musician / Software Architect) Co-Chair - TCG Trusted Mobility Solutions WG Co-Chair - TCG Metadata Access Protocol SG Chair - Linux Foundation Open Printing WG Secretary - IEEE-ISTO Printer Working Group Co-Chair - IEEE-ISTO PWG Internet Printing Protocol WG IETF Designated Expert - IPP & Printer MIB Blue Roof Music / High North Inc http://sites.google.com/site/blueroofmusic http://sites.google.com/site/highnorthinc mailto: blueroofmusic@gmail.com PO Box 221 Grand Marais, MI 49839 906-494-2434 <(906)%20494-2434>

On Thu, Feb 14, 2019 at 10:13 AM cajun-rat notifications@github.com wrote:

In a couple of places in the spec we talk about comparing strings. Since these are likely to be Unicode, there is not a single method to perform a comparison. We should be explicit about which Unicode canonicalization should be used, and which comparison algorithm is implied when we say that a pair of 'Hardware Identifiers match' or a delegation's wildcard path matches a target.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uptane/uptane-standard/issues/42, or mute the thread < https://github.com/notifications/unsubscribe-auth/ATe6O9VHW7OrPphVzx7jjaU-iaEMFU_Nks5vNX0NgaJpZM4a7zZ9

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uptane/uptane-standard/issues/42#issuecomment-463679014, or mute the thread https://github.com/notifications/unsubscribe-auth/ALLzkXNIYQ7ZGWuB-n8Z2-gEsitHH_d1ks5vNYYsgaJpZM4a7zZ9 .

iramcdonald commented 5 years ago

Hi Marina,

Interesting - thanks.

The reason that I mentioned ECU-side string conversion is that the native RTOS APIs as well as many/most application libraries do NOT actually exchange NFC canonical UTF-8, so the ECU will have to do some string conversion before sending (and some on receiving before pouring into local APIs).

Cheers,

Ira

Ira McDonald (Musician / Software Architect) Co-Chair - TCG Trusted Mobility Solutions WG Co-Chair - TCG Metadata Access Protocol SG Chair - Linux Foundation Open Printing WG Secretary - IEEE-ISTO Printer Working Group Co-Chair - IEEE-ISTO PWG Internet Printing Protocol WG IETF Designated Expert - IPP & Printer MIB Blue Roof Music / High North Inc http://sites.google.com/site/blueroofmusic http://sites.google.com/site/highnorthinc mailto: blueroofmusic@gmail.com PO Box 221 Grand Marais, MI 49839 906-494-2434

On Fri, Feb 15, 2019 at 4:09 PM mnm678 notifications@github.com wrote:

Hi,

In the TUF area, we have been discussion creating shared wireline formats to allow for interoperability between implementations (details here:

https://github.com/theupdateframework/taps/blob/21a2ee49b395346789074cc8ad8b73b5f89e5b0f/tap11.md ). I think a version of this could be useful in allowing Uptane users to specify Unicode canonicalization (and comparison method) for their implementation. This might prevent the need for string conversion on ECUs.

-Marina

On Thu, Feb 14, 2019 at 10:52 AM iramcdonald notifications@github.com wrote:

Hi,

And to be compatible with all IETF standards-track protocols (including TLS), all strings MUST be created by the Sender and normalized by the Receiver into UTF-8 encoding (RFC 3629) and normalized to comply w/ Net Unicode (RFC 5198) into Unicode Normalization Form C (NFC) defined in:

http://www.unicode.org/reports/tr15/

That "C" stands for "composed" (i.e, if a Unicode code point is defined that combines two glyphs, then use that code point). Note that Apple macOS (for historical reasons) instead uses "NFD" (decomposed) which makes strings longer and (without careful reordering) can introduce ambiguities because of the ordering of the decomposed code points.

Note: The use of UTF-16 (RFC 2781) MUST be PROHIBITED entirely in any Uptane implementation. A number of commercial applications still generate UTF-16, so a string conversion and normalization library on both the Sender and the Receiver is necessary.

Cheers,

Ira

Ira McDonald (Musician / Software Architect) Co-Chair - TCG Trusted Mobility Solutions WG Co-Chair - TCG Metadata Access Protocol SG Chair - Linux Foundation Open Printing WG Secretary - IEEE-ISTO Printer Working Group Co-Chair - IEEE-ISTO PWG Internet Printing Protocol WG IETF Designated Expert - IPP & Printer MIB Blue Roof Music / High North Inc http://sites.google.com/site/blueroofmusic http://sites.google.com/site/highnorthinc mailto: blueroofmusic@gmail.com PO Box 221 Grand Marais, MI 49839 906-494-2434 <(906)%20494-2434>

On Thu, Feb 14, 2019 at 10:13 AM cajun-rat notifications@github.com wrote:

In a couple of places in the spec we talk about comparing strings. Since these are likely to be Unicode, there is not a single method to perform a comparison. We should be explicit about which Unicode canonicalization should be used, and which comparison algorithm is implied when we say that a pair of 'Hardware Identifiers match' or a delegation's wildcard path matches a target.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uptane/uptane-standard/issues/42, or mute the thread <

https://github.com/notifications/unsubscribe-auth/ATe6O9VHW7OrPphVzx7jjaU-iaEMFU_Nks5vNX0NgaJpZM4a7zZ9

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/uptane/uptane-standard/issues/42#issuecomment-463679014 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ALLzkXNIYQ7ZGWuB-n8Z2-gEsitHH_d1ks5vNYYsgaJpZM4a7zZ9

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/uptane/uptane-standard/issues/42#issuecomment-464200153, or mute the thread https://github.com/notifications/unsubscribe-auth/ATe6Oy25JHYWM5QQmtnG9UNVywcVK5oBks5vNyIEgaJpZM4a7zZ9 .

JustinCappos commented 5 years ago

Today we resolved to make the requirements `unique encoding', etc. in the specification. @iramcdonald , would you kindly help?

When Mike + others from Airbiquity and @awwad / @mnm678 write their profiles, they will have this level of specificity.

tkfu commented 5 years ago

I've opened up https://github.com/uptane/uptane-standard/pull/84 to address this, or at least start to.

tkfu commented 5 years ago

On the 03/13 standards call, we noted that there's a PR open and awaiting review. Once it's reviewed/accepted/merged, we can close this.

tkfu commented 5 years ago

In my view, mandating that all strings in the metadata conform to RFC5198 takes care of the string comparison issue. #84 is merged after an approving review, so I am going to close this. If anyone thinks it's not yet resolved, please leave a comment and I'll re-open.

uptane / uptane-standard

One does not simply "compare" Unicode strings #42