tc39 / proposal-intl-messageformat

TC39 Proposal for Intl.MessageFormat
https://tc39.es/proposal-intl-messageformat
MIT License
111 stars 9 forks source link

This proposal is stuck #49

Open eemeli opened 7 months ago

eemeli commented 7 months ago

This proposal is currently stuck, and unlikely to advance for some number of years.

As originally proposed, this proposal is about introducing Intl.MessageFormat as a native parser and formatter for MessageFormat 2.0 messages.

Leading up to and during the 2024 February TC39 meeting (presentation and discussion, first continuation, second continuation), concerns were raised by some committee members about introducing a parser for a new domain-specific language into JavaScript, to the extent that the committee as a whole was not comfortable with advancing this proposal until:

To standardize the syntax of a DSL, it would be meaningful/persuasive to see around a dozen organizations of various sizes, including ones which were not involved in MF2 development, make significant use in production of MF2 syntax across their stack (engaging application developers, translators, infrastructure developers, …). This will likely be required for Stage 2.7. It remains to be defined whether an intermediate, lower amount of experience would be sufficient for Stage 2.

During the meeting, I raised the possibility of leaving out the syntax parser from the proposal (#47), and initially only supporting a data model representation of messages. Support for the MF2 data model has always been a core part of this proposal, as it's able to represent messages in all other current localization formats, enabling their users to use the formatting runtime. This received some tentative support from TG1, but with a request and expectation for further discussion in TG2 before bringing the matter back to TG1 for advancement to Stage 2.

To validate the approach, I've written and published an MF2 syntax → data model parser that minifies and compresses to about 2.2kB of JavaScript.

In TG2, this was discussed during the 2024-02-22 and 2024-03-28 (notes not yet published) calls, during which the Google internationalization team raised concerns about the proposed approach (quoting @sffc):

  1. First, the primary deliverable artifact for the WG is the syntax, not the data model. The syntax is subject to a significant amount of design work, and is the primary deliverable.
  2. Second, we believe Intl.MessageFormat is only useful with a serialized form. The lack of this would result in users inventing non-standard competing serialized forms.
  3. Third, having the data model alone lends itself to JSON serialization. The verbosity of JSON form is an impediment to authoring, a concern we spent considerable time trying to prevent.
  4. Fourth, the processing of the data model format into a string is relatively easy, which doesn't meet the bar for an ECMA-402 proposal.
  5. Fifth, we are still expecting changes to the data model based on user feedback on the technical preview implementations in ICU; we have already received feedback from the ICU-TC from an initial cursory review. Stabilizing it in ECMA-402 is premature at this point in time.

While some of these positions could be argued about (and I'll do so below in a separate comment), overall this does leave the proposal as it currently stands unable to proceed in TC39 until there's support for a syntax parser.

Rather unfortunately, this experience also leaves me personally somewhat disillusioned about TC39 being the right forum for advancing Web localization, and I'll need to reconsider where to spend my energies going forward.

eemeli commented 7 months ago

Replying here to the concerns raised by @sffc and @FrankYFTang during the call:

First, the primary deliverable artifact for the WG is the syntax, not the data model. The syntax is subject to a significant amount of design work, and is the primary deliverable.

To be precise, the data model is literally the first deliverable of the Unicode MessageFormat WG. It is also the part of the specification on which we've historically by far spent the most time, and which has been a significant driver for the syntax design.

It is of course valid to note that the syntax is an essential part of the whole package, but as noted in the WG's deliverables, it is a "formal definition of the canonical syntax for representing the data model", and it is effectively worthless without a data model that assigns meaning to its parts.

Second, we believe Intl.MessageFormat is only useful with a serialized form. The lack of this would result in users inventing non-standard competing serialized forms.

This is the current web reality, where there is no standardized serialized form for messages, and so everyone is already inventing their own. To account for that, message data model support has been a part of this proposal from the beginning, to provide an on-ramp for users of all current localization formats. Initially dropping the syntax parser would mean that we only provide that "on-ramp", so that standardization can start from the formatting runtime, and tools building on top of that.

I would also challenge the implicit assertion that "inventing non-standard competing serialized forms" is a bad idea, if and as a common data model would ensure that they are all compatible with each other. A core assertion made in TG1 is that it's not certain that the current MF2 syntax is the best possible, and will be universally adopted. If that's true, then forcing its adoption will ultimately lead to a sub-optimal end result.

Third, having the data model alone lends itself to JSON serialization. The verbosity of JSON form is an impediment to authoring, a concern we spent considerable time trying to prevent.

The data model is also serializable as MF2 syntax, which would make much more sense for editing. It is of course possible to also serialize it as JSON, but why should this be expected, in particular as noted that this would be an impediment to authoring?

Serializing the data model to JSON for e.g. network transmission does absolutely make sense (in which case its repetitive verbosity is effectively compressed away), and may make sense for compiled data, but why should anyone use the JSON serialized form as their source of truth and directly work with it?

Fourth, the processing of the data model format into a string is relatively easy, which doesn't meet the bar for an ECMA-402 proposal.

Frankly, I'm not sure that I understand this objection. Having written a polyfill for the proposal, I'd like to note that the parser is about 1/3 of the total size, and the formatter is the remaining 2/3. The spec text that's included in this proposal is currently 57kB, 1161 lines. The PR dropping parsing from the proposal drops 6 lines, and modifies 4 others.

This proposal not only deals with formatting into a string, but also parts, and defines how users may define and use custom functions within the messages. Getting all of this right, while also correctly accounting e.g. for bidirectional isolation (#30) is challenging. The formatting parts of this specification (i.e. the vast majority of it) seek to provide a simple, user-friendly solution that works for users at all levels, and helps ensure that they do not make early mistakes that they'll need to pay for later.

Fifth, we are still expecting changes to the data model based on user feedback on the technical preview implementations in ICU; we have already received feedback from the ICU-TC from an initial cursory review. Stabilizing it in ECMA-402 is premature at this point in time.

At no point has it been proposed that the "tech preview" version of the syntax or data model is stabilized in ECMA-402. The intent and plan at all times has been to stabilize with a version that has been announced as final by Unicode, and to which the spec's stability policy applies.

Jack-Works commented 7 months ago

I wonder if MF2 gets in the Unicode standard (?), will it be in the language, despite this, thanks for your work on this proposal, I have supported (although no action) this proposal from the early days and hope a real good format can take the lead.

sffc commented 7 months ago

Hi @eemeli, I just want to emphasize up front that I'm excited about the prospect of MessageFormat 2.0 in the Web Platform. It's just that the removal of the string syntax was concerning to my team.

To respond to your responses:

First, the primary deliverable artifact for the WG is the syntax, not the data model. The syntax is subject to a significant amount of design work, and is the primary deliverable.

To be precise, the data model is literally the first deliverable of the Unicode MessageFormat WG. It is also the part of the specification on which we've historically by far spent the most time, and which has been a significant driver for the syntax design.

It is of course valid to note that the syntax is an essential part of the whole package, but as noted in the WG's deliverables, it is a "formal definition of the canonical syntax for representing the data model", and it is effectively worthless without a data model that assigns meaning to its parts.

The CLDR-TC resolution on 2022-04 makes clear that the syntax is the primary deliverable. Since then, the majority of time has been spent on syntax, and the data model has been driven by syntax concerns more than the other way around. We acknowledge that the data model is part of the package along with the syntax, but the list of goals cited above is not a ranking.

Second, we believe Intl.MessageFormat is only useful with a serialized form. The lack of this would result in users inventing non-standard competing serialized forms.

This is the current web reality, where there is no standardized serialized form for messages, and so everyone is already inventing their own. To account for that, message data model support has been a part of this proposal from the beginning, to provide an on-ramp for users of all current localization formats. Initially dropping the syntax parser would mean that we only provide that "on-ramp", so that standardization can start from the formatting runtime, and tools building on top of that.

It's not clear to us what the data model formatter brings to the table. A data model alone does not incentivize adoption, because it alone is not a complete working solution. Other existing syntaxes are themselves already part of their own library, and although it's nice that they can map to the data model, it's not clear why users would change their behavior.

I would also challenge the implicit assertion that "inventing non-standard competing serialized forms" is a bad idea, if and as a common data model would ensure that they are all compatible with each other. A core assertion made in TG1 is that it's not certain that the current MF2 syntax is the best possible, and will be universally adopted. If that's true, then forcing its adoption will ultimately lead to a sub-optimal end result.

Having a canonical syntax is crucial for interchange, which is why the MF Working Group has spent so much time designing it. But even so, it's poor motivation if the purpose of the proposal is to open the door for the invention of new serialized forms.

It's an understandable position that TG1 wants to see results before standardizing on the syntax, and I believe there are ways to demonstrate this to TG1.

Third, having the data model alone lends itself to JSON serialization. The verbosity of JSON form is an impediment to authoring, a concern we spent considerable time trying to prevent.

The data model is also serializable as MF2 syntax, which would make much more sense for editing. It is of course possible to also serialize it as JSON, but why should this be expected, in particular as noted that this would be an impediment to authoring?

Serializing the data model to JSON for e.g. network transmission does absolutely make sense (in which case its repetitive verbosity is effectively compressed away), and may make sense for compiled data, but why should anyone use the JSON serialized form as their source of truth and directly work with it?

The JS standard library contains JSON.stringify(), which would become the easiest way to serialize these data models. History has shown that developers opt for the easy solution when not presented with reasonable alternatives. This serialized form is not one that we want to see proliferate.

Fourth, the processing of the data model format into a string is relatively easy, which doesn't meet the bar for an ECMA-402 proposal.

Frankly, I'm not sure that I understand this objection. Having written a polyfill for the proposal, I'd like to note that the parser is about 1/3 of the total size, and the formatter is the remaining 2/3. The spec text that's included in this proposal is currently 57kB, 1161 lines. The PR dropping parsing from the proposal drops 6 lines, and modifies 4 others.

This proposal not only deals with formatting into a string, but also parts, and defines how users may define and use custom functions within the messages. Getting all of this right, while also correctly accounting e.g. for bidirectional isolation (#30) is challenging. The formatting parts of this specification (i.e. the vast majority of it) seek to provide a simple, user-friendly solution that works for users at all levels, and helps ensure that they do not make early mistakes that they'll need to pay for later.

What I meant with this fourth item was that the algorithmic nature of this code brings into question the second requirement for Stage 2 advancement of ECMA-402 proposals ("Expensive to Implement in Userland"). Most ECMA-402 proposals are motivated by the fact that they have extensive data dependencies. The lack of a data dependency and the relatively small code size means that the code must be sufficiently complex in order to motivate it. It may be possible to demonstrate this complexity, and I would encourage adding a section to the README explaining why the data model formatter is expensive to implement in userland.

Fifth, we are still expecting changes to the data model based on user feedback on the technical preview implementations in ICU; we have already received feedback from the ICU-TC from an initial cursory review. Stabilizing it in ECMA-402 is premature at this point in time.

At no point has it been proposed that the "tech preview" version of the syntax or data model is stabilized in ECMA-402. The intent and plan at all times has been to stabilize with a version that has been announced as final by Unicode, and to which the spec's stability policy applies.

As I've suggested previously, I would like to see MessageFormat 2.0 reach a "stage 3 equivalent" (basically a final draft ready to be widely implemented) before Intl.MessageFormat reaches stage 2 in TC39. The tech preview comment period just opened, and there has not yet been enough time to engage with users, collect feedback, and respond to feedback. In other words, at this point in time, MessageFormat 2.0 is still in stage 2 and working toward stage 3.

littledan commented 7 months ago

There's a way forward for this proposal. This proposal meets a clear need for JS and web developers. We shouldn't ship it in browsers before it is ready, but development should be possible to continue.

It's really premature to say "several" years--at the previous TC39 meeting, we were beginning to develop experience-based rather than time-based criteria to assess maturity, and which would be a better way to encourage continued investment..

The MF 2.0 working group has made a huge amount of progress on developing this new format and programming model based on an effort which started out working towards TC39's longstanding goals and has taken ECMA-402 as an important design point through its whole evolution.

When we see proposal champions burning out and quitting like this, it's a good time for reflection from the rest of us. Is there any way we could be more friendly and open to collaboration to avoid such outcomes, which are harmful to both the individual and the project? This has happened far too many times in TC39 after people work for years doing really excellent work, and we have to figure out how to address it.

sffc commented 7 months ago

I strongly support developing experience-based criteria to assess the maturity of the syntax.

sffc commented 7 months ago

As I've suggested previously, I would like to see MessageFormat 2.0 reach a "stage 3 equivalent" (basically a final draft ready to be widely implemented) before Intl.MessageFormat reaches stage 2 in TC39. The tech preview comment period just opened, and there has not yet been enough time to engage with users, collect feedback, and respond to feedback. In other words, at this point in time, MessageFormat 2.0 is still in stage 2 and working toward stage 3.

To be more specific about this Stage 2 timeline: the tech preview period for Unicode MF2 is open now. I expect it will take into the second half of 2024 to resolve the feedback, perhaps a bit longer depending on the nature of it. Once this happens, my team would be more comfortable with Intl.MF advancing in TC39 from the readiness point of view. I very much see the light at the end of the tunnel here.

One fundamental where I expect we differ a bit is that my team sees the Intl.MF proposal as something to come after the Unicode work, whereas I get the sense that @eemeli sees it as an additional opportunity to gather experience while things are still evolving. If this is correct, we could work together with the others in TC39 to craft a clear statement of what we would support advancing, and the things that need to happen to get there. I think developing this verbiage could be fruitful.

eemeli commented 7 months ago

I see ECMA-402 as one of the most significant implementations of MF2, especially as it's the original context and client for which the message format specification work was started in 2019. The current MF2 tech preview is intended primarily to gather feedback from implementers. The planned timeline from the Unicode MFWG point of view is for the tech preview to last one bi-annual CLDR release cycle, concluding with a final release of the MF2 spec next fall.

I would be very interested in TC39 finding a way to communicate to ECMA-402 implementers and other relevant parties that right now would be a Really Good time to look at and assess and work with the MF2 spec, as the capability to address any concerns later would be significantly limited by the MF2 stability policy, which will enter into force once the tech preview period ends, and no breaking changes may be made.

I had thought that the right way to communicate this would involve advancing this proposal to Stage 2, as that in general tends to communicate a stronger expectation of the proposal eventually being adopted by JavaScript. Without that stage advancement, a strong statement by the committee to the same effect could be the next best thing.

Is there any past precedent for such a statement?

sffc commented 7 months ago

I see, I agree it would be nice to get feedback from browser implementers during this period. It's a pity that the stage advancement signal isn't aligning with the tech preview timeline, but given that two browser implementers are represented in this thread and supportive of the proposal, I feel like we don't need to wait to solicit such feedback.