unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
228 stars 33 forks source link

Decide on formatting to something other than text #272

Closed mihnita closed 9 months ago

mihnita commented 2 years ago

From all I know all parties agree that we need to format to "something that is not text"

We need to decide what that is. Or to decide if is not for the Tech Preview.

Something like format-to-parts, or the "paradigm change" that Zibi advocates, or the (Fluent inspired?) "binding" of formatting functions with variables (I think it was called "formatable"?)

mihnita commented 2 years ago

Some thoughts on this topic.

When implementing support for MF2 in a commonly used library like ICU we should consider that ICU is used in many-many places. It is now the base for i18n support in Windows, MacOS, iOS, Android, several Linux UI frameworks, browsers, many applications.

So the functionality we expose should allow ALL if these customers to implement what they need on top of this ICU implementation.

And some of these users have their own ways to model rich UIs that might not map to a linear view of the world (sequence of parts), or a tree view of the world (DOM). We cannot force one single paradigm on them, we need something flexible.


For example text to speech support often requires the generation of parallel "tracks" of text. Text UI: "Your credit card expires on Aug. 21, 2022" TTS: "Your credit card expires on August 21st, 2022"

Depending on the expressiveness of TTS available one might do that by tagging a sequence of text with semantic info:

tts_info : {
   type: date
   value: Date {year: 2022 month: 9 day: 21 }
   fields: year, month, day
}

Or the tts might be very basic, and you might want ICU to add an "explicit spellout text stream"

tts_info : {
   type: alternate_reading
   value: "August twenty-first, two thousands twenty one"
}

The second approach is especially handy when you have custom formatters and you know the underlying TTS engine is not rich enough to know how to read it. Think people names: "Mr. Johnson" / "Mister Johnson", measurement units: "2m/s" / "2 meters per second", currency: "3.2 $US" / "3.2 American dollars", IP addresses, etc.


Another form of "overlapping parts"

Imagine you format an interval: "Schedule a vacation between {$vacation :interval year:numberic month:abbreviated day=numeric}."

The result might be plain text ("...between Aug 27-Sep 9, 2022.")

Or you want the look like the one above, but clicking in various areas to invoke various pickers. Click on Aug, and show a drop-down list with month names, click on 2022 show a dropdown list of years. Another implementation might want two date pickers, for start / end. So you click on "Aug 27" you show a start date picker, click on "Sep 9, 2022" you show an end date picker. Yet another might want to reach to a click anywhere in the full range and show a date-range picker (https://www.daterangepicker.com/)

So a library should be able to convey the information that various ranges represent different concepts, and that they might overlap:

Which means that a simple placeholder ({$vacation :interval skeleton:yMMMd}) can result in many overlapping parts. And how many these are, or what those parts are can only be determined after invoking the formatting function (in the example above there is no explicit starting year)


Some existing frameworks that don't use HTML for formatting, but still format things in UI:


TLDR: what should a library return to support ALL of these use cases.

zbraniecki commented 2 years ago

That's a great in depth analysis! I agree with your question and your summary captures cases that I see emerging. Thank you for writing it down

zbraniecki commented 2 years ago

One addition is that I am coming to conclusion that there will be a significant use case for a system that is composed of two parts - template engine to generate partially resolved message, and then something i dubbed Grammatical Correctness Engine that will take it on the input and use rules+ML to formulate final sentence.

The GCE is something George advocated for for quite a while and I see it emerging out of Alexa TTS, and Amazon Retail needs, both as a way to resolve complex phrases without combinatory explosion and as a way to allow for message variations (common requirement in VA systems) without losing grammatical correctness.

If MF2 were to aspire to be a fit for the templating part of such system (which i hope it would), then the schema of the output must be semantically meaningful for the GCE engine to reason about. Currently, in several VA systems I've encountered such GCE is hidden within TTS and bulk of what it does is reverse engineer semantic information out of plain string. My hope is that with MF2 we can avoid the template>tostring>destring>GCE>TTS and instead have template>GCE and then output of that may be consumed by GUI bindings or by TTS (which also wants to preserve annotations on how to pronounce parts).

eemeli commented 2 years ago

My suspicion is that the exact shape of the output should be an implementation question, rather than one answered by the MF2 spec. For instance, the current JS Intl.MessageFormat proposal provides a resolveMessage() method that returns a sequence of MessageValue. The shape of these values is rather implementation-dependent, as e.g. the option baskets make reference to other parts of the ECMA-402 spec.

It is rather likely of course that this JS API will internally rely on an ICU implementation, but I would still think that even its corresponding interface should be outside the scope of MF2 itself. So rather than deciding what such non-string interfaces ought to look like, could it be sufficient at the MF2 spec level to ensure that the needs of these "consumers" of the spec are satisfied, and that a conformant implementation is able to define its own "parts" output?

mihnita commented 2 years ago

Then I would expect that the format returned by MF2 is something that can be transformed into the form specified by Intl.MessageFormat.

Nobody is asking for MF2 to return a Spanned, or an AnnotatedString. Similarly to what can be done (and was done) with formatToParts today (but hopefully better)

If we don't do that, the only other options I can think of:


About Intl.MessageFormat, I am not involved, and I don't want to be involved. But I would recommend reading several times my explanation why a linear sequence of anything is limiting, and why the ability to represent overlapping "things" would be better. (if you say that Intl.MessageFormat can do that, then maybe a word other than "sequence" would be better)


Anyway, this issues is not even about deciding a formatToSomething, and what would that be. Is is about deciding if that designing that is in the scope for the Tech Preview.

romulocintra commented 2 years ago

Consensus : It's not a blocker for Tech Preview it can be added in a different phase

eemeli commented 1 year ago

@mihnita: From all I know all parties agree that we need to format to "something that is not text"

We need to decide what that is. Or to decide if is not for the Tech Preview.

Following on from the discussions in #28 and #315, I realise that I should note that my preferences on this topic are not compatible with the premise presented in the above. While I do agree that we should fully support the formatting of messages to non-string targets and that we should prototype this in the JS and ICU4J implementations, I do not think that such a non-string target should be explicitly specified in the MF2 spec.

Rather, we should develop potentially multiple implementations of such non-string formatters and through that exercise help ensure that the text of the underlying spec supports them. So I would support for example the ICU4J Tech Preview experimenting with and implementing a formatted-parts API, but I do not see why the specification of that instance's implementation should be encoded in the MF2 spec.

eemeli commented 1 year ago

Replying to @zbraniecki in https://github.com/unicode-org/message-format-wg/issues/28#issuecomment-1344276332:

@eemeli: I continue to think that the shape of an MF2 formatted parts result should not be defined by the core MF2 spec, but by the spec/implementation layers building on top of it.

How do you envision it affecting building binding layers on top of MF2 to various frameworks? If ICU4C, ICU4J, ICU4X, ECMA-402 and even maybe SpiderMonkey vs V8 will have different shape of parts and differently encode information, including inevitably that some implementations will provide information allowing bindings to do things that other implementations will not provide sufficient information for?

I think we need to build some of these layers in practice in order to get an idea of what they should really look like and how they could support the features we need of them. As far as I know, the only current such attempts so far are the ECMA-402 proposal (spec, polyfill) and unicode-org/icu4x#2272.

This interface will need to be well-specified at least in ECMA-402. There, the interface will need to be able to support a variety of different architectures and implementations built on top of it while aligning with existing JS formatToParts() APIs. All of those constraints do not necessarily make sense to encode in detail at the MF2 spec level, but they're definitely required there.

Hence my suspicion that we could more efficiently reach alignment on this by working on the implementations, rather than pre-emptively trying to figure out one right answer that satisfies everyone.

aphillips commented 1 year ago

Assuming we land #463, will that address the need for this?