Fixes #66 Manage CLDR as peer dependency

rxaviers commented 9 years ago

Ongoing summary:

[x] Manage CLDR data as peer dependency
[ ] Avoid duplicating CLDR data / avoid mirroring CLDR data in here
- [x] #73 -- @rxaviers
- [ ] Fix failing tests -- @JCEmmons
- [ ] Have CLDR to include supplemental/aliases.json or supplemental/localeAliases.json via http://unicode.org/cldr/trac/ticket/8040 -- @JCEmmons
- [ ] post-install hook to filter all the fields required by a given library (cldr/config now lives in cldr-extra/config)
- [ ] define the filtering syntax. (needs review)
- [ ] implement it.
- [ ] There is also (at least) one hardcoded list of locales in ecma402 tests that may also need to be changed.

Follow up:

[ ] Related question, json-full.zip brings duplicate identical data, e.g., en and en-US. They are identical given likelySubtags. Also, removing en-US likely subtags gives en. @JCEmmons, please correct me if I'm wrong. Why are they included? Shouldn't en suffice? This also applies to pt vs. pt-BR and probably a bunch of others.

Manage CLDR data as peer dependency

Since ibm-js/ecma402 already suggests bower usage to manage and install dependencies, the first change this PR does is to also manage cldr-data using the same tool.

// bower.json
{
  "dependencies": {
    "cldr-data": ">= 25",
    ...
  }
}

At this point, user is able to spot conflicting cldr-data peer dependency between this and other i18n libraries that leverage cldr-data (for example jquery/globalize, which has parsing methods that complements ecma402).

Avoid duplicating CLDR data everywhere

The second step this PR could take, this is optional, is to rely on the above bower setup to avoid having CLDR data embedded in this library.

The benefits of this change are:

Never again have the trouble to mirror the CLDR JSON data yourself. Bower can be used for that. By you (developing it), or by users (using it).
Avoid having CLDR data duplicated on each i18n library. Using the same example above (end application that depends on both ecma-402 and globalize), having both libraries to use the same peer dependency avoids having the same data duplicated on both libraries. It's avoided to have the same CLDR data on the two libraries. Yeap, I'm being repetitive. You see, duplication is awful :P. Note it's not only a matter of saving disk space during development. But, it's also about flattening the dependencies when serving to the client.

The changes involved here are:

rm -Rf cldr

//.bowerrc
{
  "scripts": {
    "preinstall": "npm install cldr-data-downloader",
    "postinstall": "./node_modules/cldr-data-downloader/bin/download.js -i bower_components/cldr-data/index.json -o bower_components/cldr-data/"
  }
}

require.config({
  path: {
    cldr: "bower_components/cldr-data"
  }
});

Changes of the form:

- "requirejs-text/text!./cldr/supplemental/timeData.json",
+ "requirejs-text/text!cldr/supplemental/timeData.json",

Avoid duplicating CLDR logic (and save runtime memory)

This is about a step further and it's about our both projects using the same foundation low level CLDR traverser cldrjs. But, I leave this for a next iteration after we talk about the above.

Obviously, my PR isn't complete. I want to discuss before implementing. Fixes #66

cjolif commented 9 years ago

@rxaviers this is definitely going in an interesting direction. However if I'm not mistaken @JCEmmons is post-processing the CDLR for ecma402 use (see https://github.com/ibm-js/ecma402/tree/master/cldr/config), so how do you envision this to work here? I suppose this would require yet another post-install hook?

rxaviers commented 9 years ago

The post-processing described in cldr/config has the following goal, as I've interpreted it: avoid user to download unnecessary CLDR fields. Please, correct me if I'm wrong.

Given the above concern, I assume we're all in the same page about going against data duplication, or we would end up counting coffee sugar calories, but eating a whole cake later. :P

The fine tuning is indeed a valid goal. Not only for ibm-js/ecma402, but for all i18n projects. So, why don't we extend this for all of them? We could establish a declarative way to define the "field's filter" and have the post-install hook to handle it as well.

To the fine tuning debate, I add one more question. Can we fine tune based on usage? E.g., a user formatting hours (not dates) new Intl.DateTimeFormat("en", {hour: "2-digit", minute: "2-digit"}) perhaps doesn't need a bunch of CLDR fields. Going with the post-install hook approach, the default could be picking all fields the library needs. But, user could customize that filter for the pieces he actually needs if he will.

Please, just let me know on any questions.

JCEmmons commented 9 years ago

I'm willing to take a look at this, but to be honest, I haven't had much time to do so yet. Now that CLDR 26 is out the door, perhaps I can carve out some time to look at better ways to do this, as Rafael suggests. For the time being, I have updated all of our data here to the 26 release, and updated a few testcases accordingly, so we should be all current for the next 6 months or so.

cjolif commented 9 years ago

The post-processing described in cldr/config has the following goal, as I've interpreted it: avoid user to download unnecessary CLDR fields. Please, correct me if I'm wrong.

Right we also ignore some of the .json files entirely.

The fine tuning is indeed a valid goal. Not only for ibm-js/ecma402, but for all i18n projects. So, why don't we extend this for all of them? We could establish a declarative way to define the "field's filter" and have the post-install hook to handle it as well.

That's definitely a good idea.

To the fine tuning debate, I add one more question. Can we fine tune based on usage? E.g., a user formatting hours (not dates) new Intl.DateTimeFormat("en", {hour: "2-digit", minute: "2-digit"}) perhaps doesn't need a bunch of CLDR fields. Going with the post-install hook approach, the default could be picking all fields the library needs. But, user could customize that filter for the pieces he actually needs if he will.

I suppose that would be a second step. I guess the first step is to make the post-install hook works with all the fields required by a given library. A second step could be to offer fine grain tuning on what the user is exactly leveraging the library.

One concern I have though is, if each lib as different post install hook, then on the client we will end up with duplicating the data again? Or should the various post install hooks be able to work collaboratively and come up with a version that covers all the libraries used in a given application? (not sure how to achieve that).

rxaviers commented 9 years ago

I suppose that would be a second step. I guess the first step is to make the post-install hook works with all the fields required by a given library. A second step could be to offer fine grain tuning on what the user is exactly leveraging the library.

Exactly.

One concern I have though is, if each lib has different post install hook, then on the client we will end up with duplicating the data again? Or should the various post install hooks be able to work collaboratively and come up with a version that covers all the libraries used in a given application? (not sure how to achieve that).

The post install hook is set on a given application. Each library only declares its CLDR needs. Exemplifying: https://gist.github.com/rxaviers/dbd2e56840c5c7a508e8

cjolif commented 9 years ago

Thanks for the sample.

I guess cldr.json vs bowser.js with cldr key is an implementation detail and we can sort out everything and decide on that later.

The important point is making sure the filtering syntax is expressive enough to cover current use-cases like the following https://github.com/ibm-js/ecma402/blob/master/cldr/config/ecma402_cldr_ca_buddhist_config.txt as well as making sure the syntax makes the merge operation as easy as possible.

rxaviers commented 9 years ago

The important point is making sure the filtering syntax is expressive enough to cover current use-cases like the following https://github.com/ibm-js/ecma402/blob/master/cldr/config/ecma402_cldr_ca_buddhist_config.txt as well as making sure the syntax makes the merge operation as easy as possible.

:+1:

rxaviers commented 9 years ago

Considering we are all ok with the idea of "Avoid duplicating CLDR data" according to our last meeting, I will give this PR a next round of changes and will let you know.

rxaviers commented 9 years ago

Two supplemental files cannot be find in any CLDR v26 json zip files:

supplemental/aliases.json
supplemental/localeAliases.json

Am I missing something, are they available somewhere? Should we update http://unicode.org/cldr/trac/ticket/7968 with that request?

rxaviers commented 9 years ago

I've pushed a new commit that addresses the CLDR data duplication. It implements what's described in "Avoid duplicating CLDR data everywhere" above. It's not finished though. These items need to be addressed:

Fix failing tests, :sos: (I need someone to take a look at this, see below)
Find supplemental/aliases.json or supplemental/localeAliases.json. :sos: (They're not present on any json zip files, where can we find them?)
post-install hook to filter all the fields required by a given library (cldr/config now lives in cldr-extra/config)
- define the filtering syntax.
- implement it.

I will bring an initial definition for the filtering syntax above. I need a help on the two :sos: items above.

Failing tests:

Warning: Test main - testAbstractOperations - matcherFunctions FAILED on firefox 32.0.3 on LINUX:
AssertionError: BestFitMatcher() should return the correct locale for language tag "en-US": expected 'en-US' to equal 'en'
No stack
Test main - 9.1 - Test_9_1_b FAILED on firefox 32.0.3 on LINUX:
AssertionError: Locale zh-Hans-CN is supported, but fallback zh-CN isn't.: expected -1 to not equal -1 (Testing with NumberFormat.)
No stack Use --force to continue.

rxaviers commented 9 years ago

@JCEmmons, where can I find supplemental/aliases.json or supplemental/localeAliases.json in the CLDR json zip files please?

rxaviers commented 9 years ago

I will bring an initial definition for the filtering syntax above

I've extended https://gist.github.com/rxaviers/dbd2e56840c5c7a508e8 to include a proposal for the fields filtering (see it at the end of the gist). Is there any currently-in-use-configuration not covered?

clmath commented 9 years ago

Needs #73 (done, needs to land on master)

See https://github.com/ibm-js/ecma402/issues/73#issuecomment-61272293.

rxaviers commented 9 years ago

See #73 (comment).

It could be skipped then.

@JCEmmons, I still need your answer to the question above.

@clmath, I need your help to sort out the failing tests above.

clmath commented 9 years ago

Hi Rafael,

After my testing of your fork, I am pretty sure the problem with the tests is that ecma402 is relying on a custom version of the cldr while cldr-data install the official version. So some locales are in ecma402/cldr but not in cldr-data and vice versa which lead to different locale fallback causing the tests to fail. There is also (at least) one hardcoded list of locales in ecma402 tests that may also need to be changed.

I have no idea of why ecma402 is not using the official cldr data and I don't have enough knowledge in locales fallback to correct the tests. So I think we need your help @JCEmmons.

rxaviers commented 9 years ago

Thank you @clmath.

JCEmmons commented 9 years ago

The supplemental/aliases.json and supplemental/localeAliases.json are alias files that aren't in the official CLDR distribution. But they probably need to be added to it. For the time being, you can pull them from https://github.com/JCEmmons/cldr-ecma402 until they are added to CLDR's configuration.

JCEmmons commented 9 years ago

In response to @clmath 's comment "I have no idea of why ecma402 is not using the official cldr data and I don't have enough knowledge in locales fallback to correct the tests. " - Is that the "official cldr data" has a lot of fields that ECMA-402 doesn't need. So when I generate the customized data for ECMA-402, I strip out the stuff that we don't use. Note that the program that generates the official data and the one that generates the custom data are the same program, so the JSON keys and fields should be compatible ( although to be honest, I haven't fully tested this yet ).

I'm in the process of seeing what it would take in order to make ECMA-402 work properly with Rafael's "cldr-data". We give up some performance by loading more data, but the amount of additional data may be such that we don't care ( or can be filtered in some other fashion ).

rxaviers commented 9 years ago

The supplemental/aliases.json and supplemental/localeAliases.json are alias files that aren't in the official CLDR distribution. But they probably need to be added to it

@JCEmmons can you handle that? Do you need help from anyone in order to get that done?

clmath commented 9 years ago

So when I generate the customized data for ECMA-402, I strip out the stuff that we don't use.

I think there is more than that as the list of available locales is different between cldr-data and cldr-ecma402 (for example cldr-data provides zh-Hans-SG while cldr-ecma402 provides zh-SG). If this is relevant we have to make sure that the filtering mecanism from cldr-data take this into account.

rxaviers commented 9 years ago

It seems to me like a different content. If this is the case, I wouldn't encourage cldr-data to change what's offered by Unicode. Let's have it fixed upstream instead (i.e., having zh-SG instead of zh-Hans-SG on json-full.zip).

Related question, json-full.zip brings duplicate identical data, e.g., en and en-US. They are identical given likelySubtags. Also, removing en-US likely subtags gives en. @JCEmmons, please correct me if I'm wrong. Why are they included? Shouldn't en suffice? This also applies to pt vs. pt-BR and probably a bunch of others.

rxaviers commented 9 years ago

@clmath

There is also (at least) one hardcoded list of locales in ecma402 tests that may also need to be changed.

Is this change needed to get this working or is it an improvement that could be tackled in a separate issue?

clmath commented 9 years ago

@rxaviers I think it should be tracked in this issue as it is related to the list of locales available in ecma402. We will need to udpate it to match the final list of available locale so they are all tested.

rxaviers commented 9 years ago

Ok, I added this item up to the description as well then.

rxaviers commented 9 years ago

About the failing tests, @clmath has already pointed out some customizations made by ecma402 which are not present (or different) in the official CLDR, which may be causing the issue.

We need to list those differences that is causing problems. So, we can either: (a) have CLDR changed accordingly, or (b) have ecma402 code changed to conform with official CLDR.

@clmath @JCEmmons ideas?

clmath commented 9 years ago

I think that is @JCEmmons call.

On a side note I updated the description to keep the link to the locale list.

rxaviers commented 9 years ago

On a side note I updated the description to keep the link to the locale list.

:+1:

JCEmmons commented 9 years ago

Yes, I can handle it. See http://unicode.org/cldr/trac/ticket/8040

Regards,

John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo@us.ibm.com

From: Rafael Xavier de Souza notifications@github.com To: ibm-js/ecma402 ecma402@noreply.github.com Cc: John Emmons/Austin/IBM@IBMUS Date: 11/18/2014 06:30 AM Subject: Re: [ecma402] Fixes #66 (WIP) (#68)

  The supplemental/aliases.json and supplemental/localeAliases.json are
  alias files that aren't in the official CLDR distribution. But they
  probably need to be added to it

@JCEmmons can you handle that? Do you need help from anyone in order to get that done?

— Reply to this email directly or view it on GitHub.

rxaviers commented 9 years ago

Yes, I can handle it. See http://unicode.org/cldr/trac/ticket/8040

:+1:

JCEmmons commented 9 years ago

In answer to your question about en-US vs. en, pt-BR vs. pt, etc. - yes they can and should be identical. And if your software is smart enough to do removal of likely subtags, you don't need en-US, fr-FR, etc. When we put together the original zip files, we tried not to make any assumptions about what people should and shouldn't do with the data. We didn't want people to have to implement the inheritance mechanism as defined by tr35 just to use the locale data.

ibm-js / ecma402