nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.41k stars 28.99k forks source link

Adopt globalize for i18n #1494

Closed mikeal closed 9 years ago

mikeal commented 9 years ago

The jQuery team along with several other people working on i18n an JS standards have put together a new library for i18n.

https://github.com/jquery/globalize

We've had a lot of conversations in the TC about how taking on ICU is "too big" for core and that we would prefer a more modular approach that allowed us to load language support modularly but nobody has written this yet. Globalize would appear to be at least part of this solution.

Thoughts?

jasnell commented 9 years ago

@srl295

jasnell commented 9 years ago

My first question would be why? While globalize is a great project, there is already an EcmaScript standard Intl interface that is supported in V8 based on ICU and the Intl stuff has already been switched on in Node v0.12.x. @srl295 has gone to significant lengths to minimize the default footprint of ICU and to modularize the data to make it possible to use npm to install the additional CLDR/ICU data files. Globalize is fantastic to supplement the functionality currently not supported by the EcmaScript Intl API but developers can already make use of that without io.js or node.js having to do anything in core.

piscisaureus commented 9 years ago

:+1:

Some remarks:

piscisaureus commented 9 years ago

there is already an EcmaScript standard Intl interface that is supported in V8

based on ICU and the Intl stuff has already been switched on in Node v0.12.x.

But it has deliberately not been turned on in io.js, because we weren't happy with the solution.

@srl295 has gone to significant lengths to minimize the default footprint of ICU and to modularize the data to make it possible to use npm to install the additional CLDR/ICU data files.

Although it is now possible to fetch those data files with npm, it hasn't really been modularized. Node needs to be started up with particular command line arguments (or with an environment variable set), by which the ICU data that will be used globally is specified. It's not possible for a module that needs ICU data to load it on demand.

jasnell commented 9 years ago

Ok, that's fine. Additional modularization is something that can be explored, but let's not discount the work that's been done so far. Incremental improvement is A Good Thing.

So, right now, for very important performance reasons, ICU does a one time initialization of it's data files and memory maps everything. The downside, as you point out, is that the data files have to be specified at start up time, with modules getting whatever they get from the environment.

So let's explore what this "load on demand" would mean...

Globalize currently depends on 'cldr-data', when you npm install cldr-data, it goes out and downloads all the cldr-data....

bash-3.2$ npm install cldr-data
npm WARN package.json a@ No description
npm WARN package.json a@ No repository field.
npm WARN package.json a@ No README data
npm WARN package.json http-problem@0.0.1 No repository field.
\
> cldr-data@27.0.3 install /Users/james/tmp/node_modules/cldr-data
> node install.js

GET `https://github.com/unicode-cldr/cldr-core/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-dates-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-cal-buddhist-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-cal-chinese-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-cal-coptic-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-cal-dangi-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-cal-ethiopic-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-cal-hebrew-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-cal-indian-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-cal-islamic-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-cal-japanese-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-cal-persian-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-cal-roc-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-localenames-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-misc-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-numbers-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-segments-modern/archive/27.0.3.zip`
GET `https://github.com/unicode-cldr/cldr-units-modern/archive/27.0.3.zip`
  [========================================] 17271320/17264454 100% 0.0s
Received 28753K total.
Unpacking it into `./`
cldr-data@27.0.3 node_modules/cldr-data
└── cldr-data-downloader@0.2.2 (progress@1.1.8, q@1.0.1, adm-zip@0.4.4, request-progress@0.3.1, nopt@3.0.1, mkdirp@0.5.0, npmconf@2.0.9, request@2.53.0)
bash-3.2$ 

Only once everything is downloaded, can you load it in a "modular" way by cherry picking exactly which downloaded files to load into memory. Regardless of what you end up pulling in using require, you end up having to download everything. (btw, doing a du -hs node_modules/cldr-data shows 239M)

jasnell commented 9 years ago

Further, it's not exactly clear how the "load on demand" model would actually work here. Globalize is written the way it is in order to keep from having to load the entire CLDR dataset on a client-side connection. However, when used on the server side in node, we end up downloading the entire set anyway as part of the cldr-data installation (The ecma402 shim is no different in this regard). So it's not exactly clear what the advantage is on the server side. Perhaps you could take a few minutes to draw out how the load on the demand model would / should work?

rxaviers commented 9 years ago

Globalize currently depends on 'cldr-data'

Nope. Globalize uses whatever CLDR source you provide, not necessarily from cldr-data (note it's listed as a peer dependency, not a direct dependency). Therefore, although you can use cldr-data for convenience, you don't need to. For example, one could use https://github.com/unicode-cldr/ as source.

rxaviers commented 9 years ago

Further, it's not exactly clear how the "load on demand" model would actually work here

Globalize needs CLDR content to function properly, although it doesn't embed or host such content. Instead, Globalize empowers developers to load CLDR data the way they want. Vanilla CLDR in its official JSON format (no pre-processing) is expected to be provided (via Globalize.load(<json>)). Developers can use up-to-date CLDR data directly from Unicode as soon as it's released, without having to wait for any pipeline on our side.

I'm happy to answer to any Globalize question. Please, just let me know if I can help with something.

jasnell commented 9 years ago

Ok, that's fair (and thanks for that reminder @rxaviers ). Like I said, better modularization in ICU is definitely something that can be worked on. I'm just not exactly clear what the overall benefit would be by having io.js "adopt" globalize vs. incrementally improving the icu based solution, particularly given the V8 support that already exists and given that there's absolutely nothing stopping developers from already using globalize if they want. In other words, why would io.js need to do anything in core with regards to globalize?

jasnell commented 9 years ago

(btw, my apologies for misspeaking... I'd actually forgotten that cldr-data was a peer-dependency)

piscisaureus commented 9 years ago

Like I said, better modularization in ICU is definitely something that can be worked on.

It seems that it would require some serious work to libicu such that multiple "instances" can be constructed (as opposed to there being one singleton instance).

Are there any plans to that end? A quick search of the website/mailing list didn't turn up anything (there was some discussion around ICU4J but not for the c++ implementation).

jasnell commented 9 years ago

@srl295 would be able to say for certain how much work would be involved but the "fix" would be to allow multiple core data files, one for each locale. The --icu-data-dir mechanism already allows multiple paths to be specified, the challenge is that the ICU data loader stops on the first core file found. That's the change we'd need to make. It's definitely something I could look into doing On Apr 22, 2015 12:55 PM, "Bert Belder" notifications@github.com wrote:

Like I said, better modularization in ICU is definitely something that can be worked on.

It seems that it would require some serious work to libicu such that multiple "instances" can be constructed (as opposed to there being one singleton instance).

Are there any plans to that end? A quick search of the website/mailing list didn't turn up anything (some discussion around ICU4J but not for the c++ implementation).

— Reply to this email directly or view it on GitHub https://github.com/iojs/io.js/issues/1494#issuecomment-95316759.

mikeal commented 9 years ago

It looks like there's a lot of ecosystem work going around i18n.

With so much going on I think it's a bad idea to "pick one." It would probably be best to find a way for developers to bind the library of their choice to Intl in userland. I don't know how doable this is but maybe it's time we ping the v8 team about this.

rxaviers commented 9 years ago

Definitely, there are. A little more about that farm in https://github.com/rxaviers/javascript-globalization/

@mikeal, please could you describe in a little more detail which i18n support does io.js need?

Fishrock123 commented 9 years ago

Converging is going to require us to turn on some sort on Intl by default, and since we already have most of that, just not default, and that's where work is going to be, I'm going to close out and defer to https://github.com/nodejs/node/issues/26. Re-open if necessary though.