tc39 / proposal-stable-formatting

A TC-39 proposal to bring stable Intl-inspired formatting options to ECMAScript
12 stars 2 forks source link

Consider a pure ECMA262 approach #12

Closed FrankYFTang closed 1 year ago

FrankYFTang commented 1 year ago

I believe the motiation of this proposal could and should be addressed by a pure ECMA262 solution. Consider the following changes to ECMA262

Change 21.4.4.41 Date.prototype.toString ( ) https://tc39.es/ecma262/#sec-date.prototype.tostring

21.1.3.6 Number.prototype.toString ( [ radix ] ) https://tc39.es/ecma262/#sec-number.prototype.tostring

21.2.3.3 BigInt.prototype.toString ( [ radix ] ) https://tc39.es/ecma262/#sec-bigint.prototype.tostring

to 21.4.4.41 Date.prototype.toString ( [options] ) 21.1.3.6 Number.prototype.toString ( [ radix ] [ , options ] ) 21.2.3.3 BigInt.prototype.toString ( [ radix ] [ , options ])

and specify how these three functions should read the options and create the formatted result string differently

The options read and respected by Date.prototype.toString will be only a subset of what the toLocaleString accept For example, it will NOT read "localeMatcher", "calendar", "numberingSystem", "hour12", "dateStyle", and "timeStyle", but will read "hourCycle", "timeZone". And those list in https://tc39.es/ecma402/#table-datetimeformat-components could be decided by the proposal to include for reading or not.

The options read and respected by Number.prototype.toString and BigInt.prototype.toString will be only a subset of what the toLocaleString accept For example, it will NOT read "localeMatcher", "numberingSystem", "style", "currency", "currencyDisplay", "currencySign", "unit", "unitDisplay", but will read other options listed in https://tc39.es/ecma402/#table-numberformat-resolvedoptions-properties

@zbraniecki @sffc

eemeli commented 1 year ago

At a first glance, that looks like a potential solution, but I'm not sure yet if there are cases that this wouldn't solve, so this will need more consideration.

I'd be open to including more than one possible solution in the explainer, so would welcome a PR adding something like this.

sffc commented 1 year ago

Yes I think this is a valid solution to consider. We should make sure this is added to the explainer and slides in advance of the Tokyo meeting.

zbraniecki commented 1 year ago

I am aligned with Frank's proposal. I think it looks cleaner than the ECMA-402 solution.

sffc commented 1 year ago

Although Frank's solution is one we should certainly consider, I do think we should compare this to the locale-based approach.

Here are some snags with the toString approach:

  1. The defaults for Intl.NumberFormat and Intl.DateTimeFormat are different. For example, Intl.NumberFormat has useGrouping enabled by default.
  2. toString is an extremely old API that carries a higher risk of web compatibility risks. For example, I wouldn't be surprised if there's some web site somewhere that uses the extra argument for something, and it's been working for 20 years, and this proposal would break it.
  3. The radix argument gets in the way. Ideally it would have been an argument in the options bag.

Compare these two call sites:

(12345).toString(10, { useGrouping: "auto" })
(12345).toLocaleString(null)

The other potential advantage with the locale-based approach is that I expect it is easier to specify. If we add this stuff to Number.prototype.toString, we need to add all this spec to 262, and I don't want to duplicate logic between 262 and 402 if possible. This would be a hard editorial task. However, adding a null locale is fairly easy to specify: we can write down a specified version of the ILD algorithms throughout 402.

I worry that saying we need to modify Number.prototype.toString would make this proposal infeasible simply because that is a much harder task to specify, test, and align on.

sffc commented 1 year ago

Another thing: we get formatToParts automatically with the locale-based approach.

FrankYFTang commented 1 year ago

Another thing: we get formatToParts automatically with the locale-based approach.

But that is what you do NOT need for the given use case.

FrankYFTang commented 1 year ago

The other potential advantage with the locale-based approach is that I expect it is easier to specify. If we add this stuff to Number.prototype.toString, we need to add all this spec to 262

According to my understanding of this proposal, you want to clearly specific an algorithm which is not implement dependent, so such detail need to be specify anyway and this approach actually reduce the thing you need to specify since it is limited only to Number, BigInt and Date without the need to specify every other Intl Objects (ListFormat, DisplayNames, Collators, RelativeTimeFormat, etc) The surface space of what using a specific locale value is much much bigger than three toString functions will impact. Therefore, it is much harder to specify in the locale-base approach, if you consider the rest of the Intl objects.

FrankYFTang commented 1 year ago

The radix argument gets in the way. Ideally it would have been an argument in the options bag.

Actually, you can just change it to

21.1.3.6 Number.prototype.toString ( [ options ] )
21.2.3.3 BigInt.prototype.toString (  [  options ])

and first check the type of options, if the type of options is a number, then treat it as radix if the options is an object, then you can take a radix from the options to use it as radix and read other values. In this way, radix is not on the way

FrankYFTang commented 1 year ago

2. uses the extra argument for something

I do not understand what does this mean, if caller put in extra argument to Date.prototype.toString(), Number.prototype.toString(), BigInt.prototype.toString() , what do you think it could happen as today?

If TC39 concern that is risky, then you can just invent three new method do the same thing instead, such as

 Date.prototype.toFormattedString ( [options] )
 Number.prototype.toFormattedString ( [ options ] )
 BigInt.prototype.toFormattedString ( [ options ])
FrankYFTang commented 1 year ago

@anba

zbraniecki commented 1 year ago

The other potential advantage with the locale-based approach is that I expect it is easier to specify.

I would argue that this should absolutely not be part of the evaluation. We do not build standard API around creeping on existing API for our convenience. Implementability is definitely an important factor, but chosing to piggyback on ECMA-402 because it's easier for us to specify, in light of concerns raised about which API surface this need should be filled by, seems like a dangerous mental model.

sffc commented 1 year ago

Another thing: we get formatToParts automatically with the locale-based approach.

But that is what you do NOT need for the given use case.

I'd argue that formatToParts is absolutely useful for the use cases (programmatic processing of formatted strings).

this approach actually reduce the thing you need to specify since it is limited only to Number, BigInt and Date without the need to specify every other Intl Objects (ListFormat, DisplayNames, Collators, RelativeTimeFormat, etc)

I would think that RelativeTimeFormat and maybe ListFormat are in scope. Certainly DurationFormat is in scope, but we have a much nicer Temporal.Duration.prototype.toString API already.

first check the type of options, if the type of options is a number, then treat it as radix

Sounds dangerous for web compat but if it works then that seems like a decent option

If TC39 concern that is risky, then you can just invent three new method do the same thing instead, such as If TC39 concern that is risky, then you can just invent three new method do the same thing instead, such asDate.prototype.toFormattedString ( [options] )`

I like this a lot as a third option to list out in the presentation.

I would argue that this should absolutely not be part of the evaluation. We do not build standard API around creeping on existing API for our convenience. Implementability is definitely an important factor, but chosing to piggyback on ECMA-402 because it's easier for us to specify, in light of concerns raised about which API surface this need should be filled by, seems like a dangerous mental model.

Acknowledged, although I think the ease of specification is correlated with the ease of comprehension. Developers already familiar with Intl develop their own mental model of the specification, so adding something like a null parameter leverages both existing spec text and mental models of it. Adding a big new chunk of specification is that much more devs need to learn. In other words, I would state that complexity of specification is inversely correlated with learnability.

Also, we know far too well from experience with Temporal that there is a very real implementation cost for spec complexity. Engines are less likely to implement something that requires a lot of spec text (in addition to it being harder to find someone to write that spec text).

graphemecluster commented 1 year ago

This solution doesn’t cover Intl.Collator and Intl.Segmenter, unless APIs are going to be added to the String prototype. There is already a note mentioning the presence of a generic algorithm, but as stated in #13, the way to only trigger the generic algorithm has not been defined yet. Grapheme segmentation is so common that Swift even made String work on grapheme clusters instead of by code unit. Thus, I do think it’s worthwhile to include a String.prototype.segment method. However, for completeness and symmetry, I would expect new Intl.Segmenter(null).segment to behave the same way.

zbraniecki commented 1 year ago

What would be the purpose of non localizable collation? I underdtand date formatting, or even number formatting, because we have formats such as ISO.

What does it even mean to sort in a locale independent way? How do you segment a script without taking into account what locale it is in?

sffc commented 1 year ago

One can already use plain lexicographic sorting for a stable sort that is not locale dependent (useful for algorithmic applications like a binary search tree).

Agree that there may be a use case for grapheme segmentation in 262, although it would still be dependent on the Unicode version. Other types of segmentation are not well defined without a language.

graphemecluster commented 1 year ago

At least for a site with unknown language, double-clicking and triple-clicking any text still selects the current word or sentence (and the user-select CSS property changes the behavior).

One can already use plain lexicographic sorting for a stable sort that is not locale dependent (useful for algorithmic applications like a binary search tree).

ECMAScript still has no method to sort strings by code point instead of code unit. See this comment for an example.

hsivonen commented 1 year ago

(This would better fit under issue #13, but replying here, since the questions are here.)

What would be the purpose of non localizable collation?

Primarily, my thinking is: It's there, and currently to invoke it, you need to know specific languages reach it. It would make sense to be able to reach it explicitly instead of ECMA-402 implementation taking steps to suppress the ability to use und for the underlying back end. So similar to avoiding the kind of thing like "to get ISO date formatting, use Swedish", it would be logical to have a way to explicitly reach the root collation instead of "to get the root collation, use English". It would be weird not to be able to reach the root collation explicitly if we're enabling non-locale-specific instantiation of the rest of the Intl objects.

It is certainly imaginable that in a sufficiently multilingual context it would be reasonable to use a sort order that's more human-oriented than lexicographic but not specific to any locale.

To the extent people care about the emoji order (and I'm not sure how much the Web really cares), I think it's bad to give Web developers the incorrect idea that -u-co-emoji combines with any language when it actually combines only with languages that use the root as-is (English, French, non-phonebook German, Italian, etc.). Teaching that you reach the emoji order via und-u-co-emoji wouldn't imply that you can combine it with any language whereas teaching en-u-co-emoji is suggestive of it combining with any language.

Likewise for eor, thought it's even less clear to me what the level of demand for eor is compared to emoji.

How do you segment a script without taking into account what locale it is in?

For grapheme cluster segmentation, it seems very clear to me that untailored extended grapheme clusters are the answer. Swift was mentioned above. I think it's entirely reasonable to think that a future arrangement of compiling Swift to Wasm could want the Web Platform to provide a segmentation mode that matches Swift's string iteration mode, which means that the Web Platform should guarantee a way to instantiate the grapheme segmenter without tailorings.

It would be particularly bad to get a situation where first, for some years, no locale gets grapheme segmentation tailorings, the Web comes to rely on instantiating the grapheme segmenter matching Swift and then at a later date some locale gaining tailorings and the Web breaking for user for whom that locale is the host locale. (Realistically, though, we may be headed so far in that direction already that if any locale ever gains grapheme segmentation tailorings, enabling those tailorings for the host locale becomes impractical, and the segmenter will need to require a new argument or a new locale matcher type to enable grapheme segmentation tailorings regardless of whether an explicitly generic instantiation is specced now.)

Realistically, one can trust en never to gain grapheme segmentation tailorings, but I think having an explicit generic mode would be better than relying on the notion that en is generic, considering that date formatting is already a counter-example to en always being the generic locale.

For the other segmentation modes, it's remarkable how close they are to being able to handle all of them at the same time without explicit locale.