Data-Driven API - Githubissues

sffc commented 6 years ago

It can be said that the challenge of providing i18n services can be split into two concepts:

Data: One needs access to a database of locale data.
Logic: Once the data is provided, there needs to be a way to process it.

In the i18n world, as well as in software in general, people like to be able to design their own logic. There are already dozens of wrappers over Ecma 402. It is not hard to find examples of clients who reverse-engineer i18n libraries to "extract" the data out of them; I can provide some examples.

Right now, the Ecma 402 APIs are all "logic" APIs. I suggest that we consider breaking the APIs into the two concepts: data and logic. The existing APIs need not change; I suggest simply adding a new data API, and redefining the spec for the logic functions to be in terms of the data. The data format can be defined by the Unicode specification UTS 35, which is supported by another standards body.

The advantages of doing this include:

Clients can write their own i18n logic on top of Ecma 402's data, without needing to reverse-engineer the built-in logic APIs.
We can make it easy for clients to swap in their own data source to replace the Ecma 402 data.
The specs can be more clear, since the logic API can be a relatively straightforward definition on top of the data API, and the data API can refer to parts of UTS 35.

The API can be as simple as something like Intl.Data.getNumberPattern(locale) or Intl.Data.getDateTimePattern(locale, skeleton). The methods can return a promise or take a callback to allow the user to make an asynchronous pop-in replacement.

rxaviers commented 6 years ago

The theory sounds good, but the practical benefits aren't clear to me. Do you suggest to expose all CLDR data through this API or a subset? If a subset, which one? Could you cite examples/use cases where this is useful please?

rxaviers commented 6 years ago

Clarification: I can see value in exposing some data, such as display names. My confusion is basically the scope.

caridy commented 6 years ago

The real problem here is backward compatibility. I don't think backward compatibility (forever) is in the charter of UTS 35 or any other i18n data provider, while that is in the DNA of Javascript and the Web. Instead, we are aiming for a set of low-level APIs that can help you to build abstractions that rely on that data that you mentioned, but without exposing the data directly. Yes, it is more complicated, it is less flexible, but it has two very nice effect:

it is always backward compatible
it promotes the usage of good patterns for the web

sffc commented 6 years ago

CLDR has a lot of data, and it often has messy fallback rules. I was thinking that our API would be "CLDR++", where we only expose a subset of data useful for JavaScript users and take care of locale fallbacks and other intricacies of CLDR data loading under the hood. And of course if you wanted to use a data source that isn't CLDR, you're welcome to do so as long as you expose the same API.

For stability, if UTS 35 doesn't suffice, I don't see anything necessarily wrong with re-specifying the format of the subset of UTS 35 data that we provide through Ecma 402.

msaboff commented 5 years ago

unadjustednonraw_thumb_86cd unadjustednonraw_thumb_86ce unadjustednonraw_thumb_86cf unadjustednonraw_thumb_86d0 unadjustednonraw_thumb_86d1 unadjustednonraw_thumb_86d2 unadjustednonraw_thumb_86d3

sffc commented 5 years ago

@indexzero

indexzero commented 5 years ago

Thanks for including me @sffc – would love to help get involved on this issue.

I will admit that I am coming at this from a pragmatic point of view:

We use react-intl extensively. It is king in it's small framework bound domain (see: npmtrends
react-intl expects localeData as do some of their key dependencies:

The intl-{message,relative}format libraries are ponyfills that state their intention to remain up-to-date with ECMA-402 along with some additional features. Whether or not those additional features are good or bad features they illustrate the value of exposing the data in a more granular fashion. That is, there will inevitably be features built on top of Intl APIs that need to access data not currently available.

By empowering that goal we make i18n easier for applications and developers. I have seen an enormous amount of time spent bikeshedding on the most optimal way to deliver CLDR data into browsers to initialize react-intl. It would be interesting to hear from other ecosystem projects which may have similar concerns.

In what ways these ecosystem libraries will need data access remains a question for me. The data access by react-intl and its dependencies is sparse for certain edge cases, yet the library forces consumers to provide all of the CLDR data.

Perhaps reaching out to some of the folks who maintain these libraries is a good next step? Forgive me if you folks have / are already chatting with them.

sffc commented 5 years ago

Some more ideas I had.

There are cases where the user wants to provide their own data but use the browser's built-in logic, and vice-versa. If we can define a stable data language, similar to what's provided by LDML, then we can decouple that in JavaScript.

Here's an example of how a programmer could use their own data with the browser's algorithm. They give their data provider to a factory that asynchronously constructs an Intl.NumberFormat using that data provider instead of the browser's default data provider:

const dataProvider = // (user-land object implementing a data provider interface)
const factory = new Intl.Data.Factory(provider);
const fmt = await factory.createNumberFormat("ml", { style: "percent" });

The data provider interface could be as simple as: async get(localeList, xpath) returns the data at the specified xpath and the best matching locale. We would define the space of valid xpaths, which could be similar to LDML. The browser could expose this API:

const { locale, data } = await Intl.Data.defaultProvider.get(
    ["ff", "ar"], "/numbers/decimalFormats@numberSystem=latn/pattern");

If the user wants to provide their own data only when the browser doesn't have the data for that locale, they could write something along the lines of,

class MyDataProvider {
  async get(localeList, xpath) {
    const browserResult = await Intl.Data.defaultProvider.get(localeList, xpath);
    const requested = (typeof localeList === "string") ? localeList : localeList[0];
    if (browserResult.locale !== requested) {
      // call custom data service and return that result
    } else {
      return browserResult;
    }
  }
}

longlho commented 4 years ago

Thanks @sffc for redirecting me here. Since @indexzero mentioned react-intl that I happen to maintain (& Dropbox also happen to use as well) I'd like to provide some context here:

formatjs polyfills are still used even on browsers that natively support the features, just to load CLDR data since browsers don't come with all the locales, same thing with currency.
As @indexzero mentioned, we spend a significant amount of effort merging data of the same language, dedupe based on parent locale hierarchy & packing it. Then the polyfills we wrote know how to unpack the data.
Packing/unpacking CLDR data is very crucial to distribution pipeline and is common practice, similar to how momentjs's packing/unpacking IANA data.

I think at a high level what could help the workflow above is:

Expose locale negotiation, so we don't have to bundle things like legacy alias and parent locale (zh-CN -> zn-Hans-CN -> zh-Hans -> zh). This allows us to locate at least the correct language.
Ability to load CLDR data per language (not per locale).
Nice to have: packed data format.

sffc commented 4 years ago

See #87 for some discussion on your first bullet ("locale negotiation").

sffc commented 4 years ago

My feelings on this issue are going back and forth.

On the one hand, it is nice to give app developers the power to add more data when the browser provides insufficient feature or locale coverage. On the other hand, the design of Intl is for it to be "best-effort" and easy to use (hard to abuse), and this thread has raised several good points that injecting data into Intl at runtime adds a significant amount of complexity.

I know that Chrome is working long-term on dynamically adding data for new locales. I think Firefox has a similar effort. By keeping the data exchange in the browser engine, Intl's handling of CLDR data remains transparent to the user, which seems like a desirable property.

ljharb commented 4 years ago

Without the ability to object the data, polyfilling new data requires replacing almost every single Intl method; with that ability, all the methods may be correct already and just need new backing data.

sffc commented 4 years ago

Is it possible to have a function detect whether it is being called in a sync or async context? For example, could await Intl.DateTimeFormat() have different behavior than Intl.DateTimeFormat()? @ljharb

I'm just trying to think of unobtrusive ways to add data loading to the API. It would be nice if you could do the following, but it's not clear whether that is possible without breaking the web.

let dtf = await Intl.DateTimeFormat();
console.log(dtf.format(x));

One option @ljharb suggested was something like the following. It doesn't require changing the constructor, but it would give the otherwise immutable Intl.DateTimeFormat object two "states", one where data is present and one where it is not.

let dtf = new Intl.DateTimeFormat();
await dtf.load();
console.log(dtf.format(x));

We could add a new namespace for the async-enabled constructors, like Intl.Async. The new namespace would have all of the same constructors as the Intl namespace, except that they return promises that resolve to "normal" objects.

let dtf = await Intl.Async.DateTimeFormat();
console.log(dtf.format(x));

Or, we could put data loading into the terminal format method. The downside here is that you put async operations into a function that was never async before, so it might be harder to use as a drop-in replacement. For example, if you have to pass your object as an argument to some other function, that function needs to know whether to use the async version of the terminal method.

let dtf = new Intl.DateTimeFormat();
console.log(await dtf.asyncFormat(x));

// problem if you have to pass dtf to a function like this
function doStuffWithDateTimeFormat(dtf) {
  // should this function use .format() or .asyncFormat() ?
}

ljharb commented 4 years ago

You can't usefully detect that, no, and if you could it would break use cases where people don't await immediately but still do something with the promise.

If a constructor returns a promise, than instanceof will fail until it's awaited, which would be confusing.

tc39 / ecma402

Data-Driven API #210