unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
236 stars 34 forks source link

Well-formed vs valid #935

Open macchiati opened 2 weeks ago

macchiati commented 2 weeks ago

Added text 2024-11-24


I think we need to be careful about our usage of the terms 'well-formed' and 'valid'. The following is not fully fleshed out; it is more of a discussion of the issue and some ideas for the future.

We often reference other sources for identifiers, and want them to be interpreted according to that source. Sources that change over time should (and typically do) distinguish between well-formed and valid. For example, 'ge:manic' is not a well-formed locale identifier, and 'de-Flub' is not a valid locale identifier. However, 'de-Flub' could (conceivably) become valid in the future, if a script is given the code 'Flub'. Good sources also never remove identifiers, or make material changes in the meaning, but may deprecate them: those are still treated as valid.

When we reference such sources in message format, such as with option values, we have a few goals.


This is also true for our own enums, . We have in registry.md:

Implementations MAY accept additional option values for options defined here. However, such values might become defined with a different meaning in the future, including with a different, incompatible name or using an incompatible value space. Supporting implementation-specific option values for standard or optional functions is NOT RECOMMENDED.

We also have BNF:

option = identifier o "=" o (literal / variable)

The implications are that conformant implementation can interpret any of:

{$x :currency compactDisplay=short} {$x :currency compactDisplay=medium} {$x :currency compactDisplay=μικρός} {$x :currency compactDisplay=|🐭|} {$x :currency compactDisplay=$myDisplay}

It can also interpret:

{$x :currency currency=CAD} {$x :currency currency=MyCurrency} {$x :currency currency=δολάριοΚαναδά} {$x :currency currency=|¥|} {$x :currency currency=|🐭|} {$x :currency currency=$myCurrency}

It could also interpret compactDisplay=short by formatting a long form, and compactDisplay=long by formatting a short form. Or a value of CAD as being GBP, etc.

This level of freedom seems counterproductive for interoperability.


So I propose that we have the general rule something like the following, where option values are defined according to a reference to an external source

Ignore means that the expression is interpreted as if the option were not there. (I won't talk here about what signals to the caller are associated with that.)


I think we could apply that to our standard enum option values, such as the following in https://github.com/unicode-org/message-format-wg/blob/main/spec/registry.md#options-1, so that |@!$| could be recognized as ill-formed.

That is, perhaps we can have a rule in the registry for our functions, something like: the default well-formedness criteria for standard function option values matches the constraints on function option identifiers in README.md. Thus |$abc| would be ill-formed for useGrouping. Any function option that had different criteria for well-formedness of its values would simply have have an explicit well-formedness statement.


aphillips commented 2 weeks ago

A few notes:

Note that we have text about option resolution in the spec which does indeed drop bad options on the floor. But for options whose interpretation is inside the function handler, the dropping-on-the-floor part is up to the function itself. This is why there is a resolved options section in each function: it defines which options are visible downstream (functions don't currently eat any of their options)

Are there specific changes you want in the spec? I'd advise a careful look at u-namespace.md and registry.md as well as option resolution in formatting.md.

macchiati commented 2 weeks ago

I was struck by the fact that we are requiring valid for some identifiers (eg timezones), but only well-formed for currencies. Those feel like very similar cases, so if well-formed is right for currency, that term should also be right for timezones (or the inverse).

we don't "MUST ignore" options whose values are ill-formed for some of the reference sources because we allow for implementation defined values

But a straightforward reading of registry.md means that we don't allow that in many cases (whenever we say well-formed (like currencies) or valid like:

timeZone (default is system default time zone or UTC)

But that means I can't use implementation-defined identifiers like "$California Time"

aphillips commented 2 weeks ago

No, you're correct about this. We should be well-formed for acceptance but permit checking for validity. And we should fix values to permit implementation-specific gorp (mainly for platform-specific values that aren't the sanctioned identifiers)

aphillips commented 2 weeks ago

I think what we should do is: merge #911 and #922 and then do a cleanup edit on registry.md in a new PR

macchiati commented 2 weeks ago

makes perfect sense

duerst commented 2 weeks ago

Mark (@macchiati) wrote:

  • An implementation MUST ignore any option with an option value that is ill-formed according to its source.

    • [It must ignore the option locale=|ge:manic|]
  • An implementation MUST ignore any option with an option value that isn't valid according to any version of the source. [At the time of this writing, must ignore locale=|dab|]
  • An implementation SHOULD (but need not) ignore an option with an option value that is valid according to some version of the source. [An implementation might not support Dezfuli, and thus ignore locale=|def|; it may also ignore all deprecated language identifiers, and thus ignore locale=|daf|.]

I think the SHOULD in this paragraph should be a MAY, for obvious reasons.

aphillips commented 1 week ago

This was discussed in the 2024-11-18 call. We resolved to use valid in most cases, but with careful phrasing in the boilerplate. I believe this is now addressed?

macchiati commented 1 week ago

I elaborated a bit. I would like to discuss further, after 46.1

aphillips commented 1 week ago

I see your elaboration. One callout:

"implementation" has to be used carefully. In most cases in our spec it refers to the MessageFormat framework/executable/host environment itself, e.g. in ICU4J the actual MessageFormatter class. And it is true that the ABNF and well-formed/validity rules at the message level are quite permissive about option values.

At the function set level, there is a different layer of "implementation", specifically what we call the function handler. This is what a lot of the normative language in the current registry.md is about. In general, the function handler is some code that maps option values to local API-specific representations. So for "digit size options", it parses the option value. If it's a positive integer, great. Otherwise it's not valid.

We definitely want to impose standards on options and their values, to ensure interoperability. But the MF2-level implementation has no role in this (once the message is syntactically correct). Instead, the specific function handler, such as for :integer or whatever is involved. Thus the wording needs to be precise about where the "implementation" is taking place. And it needs to not impose such restrictions as would limit extensibility or prevent the correct level in the code from receiving the information.

macchiati commented 1 week ago

I agree that there are important distinctions to be made, and in any final text we should make it clear. What I'm specifically talking about are the implementations of the standard functions defined in the registry.md. Whatever we do, it should be clear what kinds of results we can expect to have, and what kinds of errors we can expect to see raised (which might be different for ill-formed vs well-formed+invalid vs well-formed+valid+unsupported vs well-formed+valid+supported).

Some of that could apply to implementation-defined functions, but I didn't want to talk about that in this issue.