unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
228 stars 33 forks source link

Conformance with UAX #31 & UTS #55 #847

Open eemeli opened 1 month ago

eemeli commented 1 month ago

It would probably be a good idea for us to be conformant with both UAX #31 and UTS #55. We're already in #673 and the bidi-usability design doc working on some of the missing pieces, but we should do a more thorough review and provide explicit conformance statements somewhere.

When reviewing these, one part that I noticed us missing is UAX31-R4:

Equivalent Normalized Identifiers: To meet this requirement, an implementation shall specify the Normalization Form and shall provide a precise specification of the characters that are excluded from normalization, if any. [...] Except for identifiers containing excluded characters, any two identifiers that have the same Normalization Form shall be treated as equivalent by the implementation.

This is specifically recommended in UTS 55:

It is recommended that all languages that use default identifiers meet requirement UAX31-R4 Equivalent Normalized Identifiers, with the normalization described in this section. [...] Case-sensitive computer languages should meet requirement UAX31-R4 with normalization form C. They should not ignore default ignorable code points in identifier comparison.

The easiest way to be conformant with that would be to normalize indentifiers with form C before comparing them, such that e.g. tämä and tämä are considered equal ("this" in Finnish, normalized with forms C & D respectively).

I'm not completely certain whether further changes would be needed for other conformance requirements.

aphillips commented 1 month ago

Normalization can be tricky here. I think we want to avoid requiring that a message be in a specific normalization form. That would prevent patterns from using a denormalized sequence. This isn't what R4 is about: it's only about identifiers. In our case, this would be variable, function, and option names and include namespaces. It might or might not include keys (I'd have to think about it and would probably leave it up to the selector function).

Requiring NFC for namespace, variable, function, and option names seems reasonable. In practice we're mostly talking about variable names.

macchiati commented 1 month ago

One way to do that while not specifying the format of the message itself is to specify that when evaluating a message, eg, the following behave identically when their NFC forms are identical.

  1. two names
  2. two namespaces
  3. two literals

I think we should at least have a SHOULD for using IDType=RECOMMENDED for names and namespaces.

aphillips commented 1 month ago

Names and namespaces, yes. Literals we might need to put SHOULD on. The only place where literal matching takes place in our syntax is between keys and selectors, where we let the selector function decide. In that case, it might use different quality matches to differently encoded (but canonically identical) strings or it might be purely code point based (managing variants that depend on different encoded sequences would be an adventure, since the strings appear identical).

macchiati commented 1 month ago

I think a message with two selection keys that are canonical equivalents would be a serious mistake, and also break interoperability. We cannot stop a message from being canonicalized with data exchange, and that could result in two identical keys. I'm not sure about other uses of literals — they might just be a SHOULD.

I just wrote that, it appears (a surprise to me!) that there is no constraint on identical keys, so one could have:

.input {$count :number}
.match {$count}
2 {{You have a pair of notifications.}} 
2 {{You have a both notifications.}} 
one {{You have {$count} notification.}}
*   {{You have {$count} notifications.}}
aphillips commented 1 month ago

I think a message with two selection keys that are canonical equivalents would be a serious mistake, and also break interoperability.

Totally--a mistake on the part of the message author, although I can barely visualize separating a few cases e.g. singleton mappings (Ω vs.Ω) or perhaps Hangul syllables vs. jamo (가 vs.가), perhaps in writing a normalization demo? A more likely use would be a literal with only a combining mark or combining sequence with no base character in it. Really, though, we can permit and even encourage selectors to do the normalization, notably by requiring it in the :string function's selector. I'm more concerned with not requiring implementers to spend a lot of code checking literals and keys for what might best be described as a quirk best handled in the selector itself.

I just wrote that, it appears (a surprise to me!) that there is no constraint on identical keys

There isn't a constraint on identical keys, in part because of the need for them with multiple selectors:

.match ($a :number}{$b :number}
0   0   {{...satisfy yourself that each selector has duplicate keys in isolation...}}
0   one {{...}}
0   *   {{...}}
one 0   {{...}}
one one {{...}}
one *   {{...}}
*   0   {{...}}
*   one {{...}}
*   *   {{...}}

Actual selection in such cases as your example might be somewhat arbitrary, since the MatchSelectorKeys function is implementation defined. Presumably one of the "equal" most-preferred cases will be first (and thus selected). I think it might be an error for the fallback case to be duplicated, though (the all-* case).

macchiati commented 1 month ago

It would, however, be a mistake (and I as a message format composer would want to be informed of it!) if two keylists (= key *(s key)) in the abnf) were identical (and we should at least recommend warning if they are identical after NFC).

BTW, I think it would be a useful bit of structure (and associated terminology) to change

variant           = key *(s key) [s] quoted-pattern

to

variant           = key-list [s] quoted-pattern
key-list          = key *(s key)
eemeli commented 1 month ago

Forbidding variants with exactly the same keys would also be in line with the validity requirement we impose on options: https://github.com/unicode-org/message-format-wg/blob/4514e880a8690ba2dd78a70696b4c89db93697ba/spec/syntax.md?plain=1#L580-L581

Regarding which, we probably need language somewhere noting that "duplicate" covers all option identifiers that normalize to the same value.

aphillips commented 1 month ago

We have to be careful about identity here. Two keys can evaluate to the "same value" for a given selector, even without string normalization. I agree that it would be good to inform users of duplicate key lists, but the evaluation of duplication can depend on the selector function, not just the value.

We don't currently provide any date/time selection, but it provides a simple example of non-identical keys that evaluate "identically":

.local $date = {|2024-07-31T23:33:33-10:00| :datetime}  <- this is 2024-08-01T09:33:33Z
.match ($date :before}
|2024-08-01|                {{August 1 floating date value}}
|2024-08-01T12:00:00Z|      {{August 1 in some places}}
|2024-08-01T22:00:00+10:00| {{August 1 in the same places as Z}}
|2024-08-01T02:00:00-10:00| {{August 1 in the same places as Z}}
* {{ ... }}
eemeli commented 1 month ago

@aphillips Could you clarify whether you think we should or should not add a message validity requirement about identical variant keys? As in, do you think that the following message should be considered valid, or should formatting this message produce a data model error?

.match {$x :string} {$y :string}
a b {{first}}
a b {{second}}
* * {{other}}
aphillips commented 1 month ago

@eemeli I think it should be an error, as it's more helpful to users to warn them that they've specified the same thing twice, probably, as @macchiati suggests, by making two identical key lists an error.

We should have the additional proviso that the keys are not required to be "normalized" (I do not mean Unicode character normalization here) when the implementation checks them for this level of identity. Note too that existing language about quoted vs. unquoted, escaping, etc. applies. That this, this is also an error:

.match {$x :string} {$y :string}
a b    {{first}}
|a||b| {{second}}
* *    {{other}}

In other words, yes, identical key lists is a data-model-error. However, users should be warned that there is no data model error for different key lists that produce identical match results.

macchiati commented 1 month ago

Re: We have to be careful about identity here.

Clearly a particular selector-list for a particular locale can evaluate two key-lists as identical that are not (as you say). But that is not a good reason to allow either precisely identical key-lists, or canonically equivalent key-lists.

On Thu, Aug 1, 2024 at 1:34 PM Addison Phillips @.***> wrote:

We have to be careful about identity here. Two keys can evaluate to the "same value" for a given selector, even without string normalization. I agree that it would be good to inform users of duplicate key lists, but the evaluation of duplication can depend on the selector function, not just the value.

We don't currently provide any date/time selection, but it provides a simple example of non-identical keys that evaluate "identically":

.local $date = {|2024-07-31T23:33:33-10:00| :datetime} <- this is 2024-08-01T09:33:33Z .match ($date :before} |2024-08-01| {{August 1 floating date value}} |2024-08-01T12:00:00Z| {{August 1 in some places}} |2024-08-01T22:00:00+10:00| {{August 1 in the same places as Z}} |2024-08-01T02:00:00-10:00| {{August 1 in the same places as Z}}

  • {{ ... }}

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/847#issuecomment-2263933593, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGJ56B2LNZVYHU2MC3ZPKLUNAVCNFSM6AAAAABLWDXVYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRTHEZTGNJZGM . You are receiving this because you commented.Message ID: @.***>

macchiati commented 1 month ago

In other words, yes, identical key lists is a data-model-error. However, users should be warned that there is no data model error for different key lists that produce identical match results.

Agreed.

On Thu, Aug 1, 2024 at 2:55 PM Addison Phillips @.***> wrote:

@eemeli https://github.com/eemeli I think it should be an error, as it's more helpful to users to warn them that they've specified the same thing twice, probably, as @macchiati https://github.com/macchiati suggests, by making two identical key lists an error.

We should have the additional proviso that the keys are not required to be "normalized" (I do not mean Unicode character normalization here) when the implementation checks them for this level of identity. Note too that existing language about quoted vs. unquoted, escaping, etc. applies. That this, this is also an error:

.match {$x :string} {$y :string} a b {{first}} |a||b| {{second}}

    • {{other}}

In other words, yes, identical key lists is a data-model-error. However, users should be warned that there is no data model error for different key lists that produce identical match results.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/847#issuecomment-2264078784, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMDQYBFDW65GITGCVXTZPKVFPAVCNFSM6AAAAABLWDXVYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRUGA3TQNZYGQ . You are receiving this because you were mentioned.Message ID: @.***>

aphillips commented 3 days ago

I agree that we need to address this, but not that #869, as currently written, is the fix. Based on the discussion here and in that PR, @eemeli do you want to fix 869 or replace it? Or should I make a PR? (let's discuss in the call in an hour)