silverwind commented 6 years ago

Currently, users cannot rely on full i18n support to be present cross-platform and even cross-distribution mainly because different package maintainers use different configurations for ICU and if Node.js was built with system-icu one still has to have libicu installed. Browsers on the other hand generally do support full i18n out of the box.

There is the option to use the full-icu package but it is somewhat awkward to use as it requires a environment variable or commandline switch to work.

Building with full-icu is currently a ~40% increase binary size (on macOS, it goes from 35M to 49M). Is this an acceptable tradeoff? I'm thinking that if we build with it, ICU data should be moved in-tree so the build does not rely on external downloads.

cc: @nodejs/intl

ScottGibsonEmpire commented 5 years ago

I vote for full-icu to be the default. It seems to be how every other JS environment operates, and not having it just causes issues with Node. Tests run everywhere, but then fail on the CI service because of this.

damianobarbati commented 5 years ago

+1 for full-icu by default.

zuohuadong commented 5 years ago

+1 for full-icu by default.

eemeli commented 5 years ago

This topic came up during the i18n session at the OpenJS summit, and as a follow-up I put together this alternative to packaging in the full icu data to the initial binary, while still retaining access to all locales:

https://github.com/eemeli/proposal-intl-loadlocales

So the idea here is to make it possible to lazy-load locale data into the Intl object. The actual loading process and data structure is completely hidden from the application-level API, allowing it to be configured and cached separately.

jasnell commented 5 years ago

Given the way icu initializes data, it's not clear if this is possible. What information will be lazy loaded and what is the source?

callumlocke commented 5 years ago

Is there any major disagreement to doing this, and is anything blocking it?

Also: can anyone tell me where I can find a list of the subset of locales that are included in Node by default?

mhdawson commented 5 years ago

@srl295 can you comment on the default locales?

srl295 commented 5 years ago

The default subset is: English

What's blocking this consensus to increase the repo size when there's already workarounds.

nwoltman commented 5 years ago

The default subset is: English

It seems like it's just U.S. English (en-US). Taking this example:

const currencyFormatter = new Intl.NumberFormat('en-CA', {
  style: 'currency',
  currency: 'CAD',
  minimumFractionDigits: 0,
  maximumFractionDigits: 2,
});
console.log(currencyFormatter.format(0.02));

In Chrome (while in Canada) I get the expected result:

$0.02

But with Node I get:

CA$0.02

It would be a lot nicer for everyone outside of the U.S. if Node included full-icu by default.

srl295 commented 5 years ago

@nwoltman You are correct. It is en which is effectively en-US.

It would be a lot nicer for everyone outside of the U.S. if Node included full-icu by default.

Hey, it could be nicer for people inside the US also!

So there are a couple of issues, depending on what you mean by 'default'. (some of these are discussed above, and in the original tickets going back to 2013)

download size. What's available to download? node and packagers could provide multiple versions available for download. I.e. there could be a 'node-with-full-icu' download button on nodejs.org, click that to get a binary built with --with-intl=full-icu --download=all.
repo size. What's available by default when you run configure? What needs something downloaded? Including the full data means the git repo is larger, and grows by a larger amount every year.. and, possibly means turning git-lfs
user convenience. The installer (or other packagers) could add a checkbox to download the full icu dataset when you download node.

These 3 are somewhat independent. and there are downsides:

adding a full icu download is less nice for the people who run the download site. Making there be only a full icu download is less nice if people are bandwidth constrained.
making a larger git repo is less nice for devs. Not wanting to pick on git-lfs, but it does add complexity.
adding complexity to the installer could be less nice, especially towards download speeds.

nwoltman commented 5 years ago

adding a full icu download is less nice for the people who run the download site.

Counterpoint: Not having a full icu download is less nice for people who need it.

Making there be only a full icu download is less nice if people are bandwidth constrained.

Is there any data to verify this point? Who would be "bandwidth-constrained" enough that they couldn't download a few more MBs?

making a larger git repo is less nice for devs.

This seems to be one of the biggest sticking points on this issue. It's definitely a tradeoff. Can anyone estimate how much "less nice" the repo will be for devs if the full icu data were added to it (both with and without git-lfs)?

Also, would making things less nice for Node devs perhaps be worth it to make things more nice for app devs?

adding complexity to the installer could be less nice, especially towards download speeds.

Again, how much "less nice" would this actually be? Is it possible to estimate how much slower downloads will be? For me, adding 15 MB slows things down by 5 seconds, which definitely isn't a big deal (especially when compared to running npm install).

Overall, my preference would be for the default Node.js binary to come with full icu (which would be great since that would make Node more consistent with browsers) and I have no opinion on installers/packagers providing a small-icu version (since I would never use it).

bnoordhuis commented 5 years ago

Is there any data to verify this point?

I did an analysis last year in https://github.com/nodejs/node/issues/19214#issuecomment-374216527. You can probably add a few percent because ICU gets bigger, not smaller, as time goes on.

nwoltman commented 5 years ago

Thanks @bnoordhuis, that data is awesome!

Based on that information, it looks like we can conclude the following:

Developer Experience

Repo Size

A new git checkout would be about 11M bigger but it's already on the order of ~450M.

So it sounds like git-lfs would not be needed, and the increase to the repo size wouldn't have a noticeable impact.

Compile Times

negligible on platforms that support the .incbin assembler directive... Probably even a little faster than it is now because there's no post-processing. On Windows we have to do a bin2c conversion that produces a ~60M source file and takes 15-30 seconds to compile.

So compile times would not change or might improve for people on Unix systems. The slowdown on Windows would be unfortunate, but my guess is that most people would be compiling Node on a Unix-like system.

Downloader Experience

Source tarballs would be about 5.5M or about 30% bigger than they are now.

5.5M isn't very much. This probably wouldn't have an impact on most downloaders.

Based on that data, it sounds like the impact of switching to full-icu by default would be mostly negligible (with the only noticeable downside being slower compile times on Windows).

srl295 commented 5 years ago

Nathan how do you usually install node?

Bandwidth constraints are one, also small devices.

bnoordhuis commented 5 years ago

The bandwidth bill for https://nodejs.org/ might be another issue.

zuohuadong commented 5 years ago

The bandwidth bill for https://nodejs.org/ might be another issue.

use cloudflare

nwoltman commented 5 years ago

Nathan how do you usually install node?

Either downloading from the website or with the apt package manager (or sometimes relying on the Node.js Docker image). I understand that the increase in bandwidth won't affect me personally, so I am probably unaware of how it would affect other people as well as the services that need to distribute the binary.

The main question is, who exactly would be affected and by how much? With that answered, the negatives of switching to full-icu by default could be compared to the positives, and then things could move forward (or at least this issue could be closed).

@zuohuadong That's an interesting point. https://nodejs.org/ actually does use Cloudflare. Although, I'm not seeing a Cache-Control header when downloading the binaries, so I'm not sure if every binary download request is actually going to the nodejs.org servers.

mhdawson commented 5 years ago

I don't think we use CloudFlare for the downloads because we've not had time to do the work that it would take to continue to capture the download metrics. @rvagg can correct me if I'm wrong.

mhdawson commented 5 years ago

@nodejs/tsc I'm going to add to the agenda so we can get some feedback on what people think about potentially increasing the download size. If we get enough feedback on the issue before the next meeting we can remove.

rvagg commented 5 years ago

Don't use cloudflare for /download/, but if this goes through it would be good incentive to get around to sorting out the blockers for that.

longlho commented 5 years ago

Hi there, just wanna share my POV (primarily working experiences) about the shortcomings of distributed Node w/o full-icu:

full-icu has to dynamically download icu4c-data based on OS and a lot of corporate CIs don't allow Internet access & have everything vendored either via a mirror or checked in. This creates an issue in build pipeline for products using Node supporting multiple OSes & versions.
Building Node from source w/ ICU also isn't trivial for bazel (Google & DBX happen to use it). This also becomes very hard to scale as devs have to do so for their local dev envs as well.
Differences in Node CLDR & shipped v8 CLDR versions: This creates subtle bugs since they can ship w/ different versions. Having node CLDR align w/ v8 in terms of intl would be great.

MylesBorins commented 5 years ago

I do want to echo the concern mentioned earlier with increasing the binary size by 40%. as others have stated this may have negative effects on cold start times for containers which would effect serverless environments.

I've reached out to some of the engineers internally at google to get their take on how drastic a change like this could be.

longlho commented 5 years ago

Thanks for the context! I'd love to push for distributing a full-icu version in addition to the default. I understand that this creates extra work in the distribution phase but would be curious to learn what the LOE on that would be.

bnoordhuis commented 5 years ago

I do want to echo the concern mentioned earlier with increasing the binary size by 40%. as others have stated this may have negative effects on cold start times for containers which would effect serverless environments.

I don't expect that to be an issue. It's static read-only data that gets paged in on demand. You pay for what you use.

srl295 commented 5 years ago

@bnoordhuis the design is that you only pay for what you use, and that it is demand paged. In a multitenant environment, it's probably better to share a larger node binary among multiple tenants, rather than have each tenant have their own downloaded copy of the full-icu data which I would guess won't be shareable even if they are identical copies.

I don't know how this translates to serverless, but i would imagine the larger node binary ought to be shareable. The data in question is marked as RO in the segment (pure text) for this reason.

mhdawson commented 5 years ago

We did discuss this in the TSC meeting today. From the minutes:

Building with full-icu by default #19214
- Some concerns expressed last week.
- binary size
- code size
- Michael to take action to send email on TSC email list to ask if there are objections to suggestion a PR be opened to make full-icu the default and to complete the conversation there.

One thing that was mentioned was having good numbers for the impact on the the following will be needed:

size of download
size unzipped
code size in github

bnoordhuis commented 5 years ago

@mhdawson Is https://github.com/nodejs/node/issues/19214#issuecomment-374216527 sufficient? It's from last year but the numbers won't have changed dramatically in that time.

mhdawson commented 5 years ago

@bnoordhuis thanks, From https://github.com/nodejs/node/issues/19214#issuecomment-374216527 the bottom line was:

Source tarballs would be about 5.5M or about 30% bigger than they are now.
A new git checkout would be about 11M bigger but it's already on the order of ~450M.

The remaining numbers that I think would be good to have in the summary is the additional size on disk once extracted and the in memory impact if you don't use any additional languages which from some of the discussion will not be the same as the additional size on disk.

targos commented 5 years ago

Numbers I got today (Linux x64):

ICU	node binary size	loaded memory	execution time
small	43MB	~28MB	~0.02s
full	66MB	~28MB	~0.02s

To measure loaded memory, I ran ./node -p 'process.memoryUsage().rss / 1024 / 1024' To measure execution time, I ran time ./node -e '1+1'

mhdawson commented 5 years ago

@targos thanks. I guess the other number is the size of the release tarball (which I assume is smaller than the node binary size since its compressed) as what @bnoordhuis provided was the source tarball sizes.

srl295 commented 5 years ago

does full icu need to be in the repo? could it be the default for download but not build from source?

ChALkeR commented 5 years ago

Given those numbers, I'm in favor of just bundling full-icu by default.

srl295 commented 5 years ago

FWIW ICU releases are generally on a 9 month cycle. v8 is pretty aggressive about picking up new versions due to new ecma402/ecma262 requirements.

mcollina commented 5 years ago

sgtm for full icu.

Il giorno gio 15 ago 2019 alle 22:55 Steven R. Loomis < notifications@github.com> ha scritto:

FWIW ICU releases are generally on a 9 month cycle. v8 is pretty aggressive about picking up new versions due to new ecma402/ecma262 requirements.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node/issues/19214?email_source=notifications&email_token=AAAMXY6BJKBZQ4PU4YTSPMLQEW7D7A5CNFSM4EUFBU42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4M67GI#issuecomment-521793433, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAMXYYDDTDIDGWD5VB6T6TQEW7D7ANCNFSM4EUFBU4Q .

srl295 commented 5 years ago

If full ICU is in the node repo, would the plan be to move to git-lfs? or just check in the 26M file as a binary blob icudt*.dat without lfs? ( ~6M with xz, ~10M with gzip, ~9.5M with bz2 )

bnoordhuis commented 5 years ago

I strongly prefer that we simply check in the (compressed) blob. git-lfs == extra friction.

We can use a python script to decompress it if we don't want a dependency on gunzip(1) or bunzip2(1).

srl295 commented 5 years ago

there's a https://docs.python.org/3/library/lzma.html can we use it?

bnoordhuis commented 5 years ago

Unfortunately no. lzma is python3 only.

Fishrock123 commented 5 years ago

Does building with full-icu impact heapsnapshot sizes, or memory overhead while taking one?

jasnell commented 5 years ago

It shouldn't. ICU data is memory mapped and not allocated. It may bump rss up but should have no impact on heap or generation of heapsnapshots

On Fri, Aug 16, 2019, 20:06 Jeremiah Senkpiel notifications@github.com wrote:

Does building with full-icu impact heapsnapshot sizes, or memory overhead while taking one?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node/issues/19214?email_source=notifications&email_token=AADLM6PT74AOIZIO4KUYOT3QE5TMTA5CNFSM4EUFBU42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QCAWQ#issuecomment-522199130, or mute the thread https://github.com/notifications/unsubscribe-auth/AADLM6LM46XMKEN2XMRDZPLQE5TMTANCNFSM4EUFBU4Q .

rvagg commented 5 years ago

Maybe this is too radical, but you could consider turning off .gz offerings and forcing .xz only from the point you start offering full-icu onward, that would save on storage and a little release build time, while encouraging users toward a bandwidth saving option too.

jasnell commented 5 years ago

Discussed briefly on the TSC call today. The arguments for or against enabling full-icu by default have not really changed much. If we ignore the binary download size issue for just a minute, one of the more significant issues is the amount of reserved runtime memory required by memory mapping the full ICU data set. One commpromise approach we could take would be:

Continue building the node.js binary with the small-icu subset statically linked
Bundle the full-icu data with the node.js distribution such that it is always available but not used by default.
Enable the use of full-icu with a command line switch.

This approach would be nearly identical to the existing approach using the full-icu npm distribution but would bundle the full-icu data set rather than having it installed manually. The node.js binary would also Just Know™️ where to look for the data set rather than the user having to provide the full path.

Longer term, what would really help advance this along would be making changes internally to ICU that allow incremental on-demand loading of ICU data subsets. It's been talked about for quite some time but has not yet been implemented. Maybe (ahem, google) some company (ahem, microsoft) could dedicate (ahem, ibm) some resources to making those changes long term?

srl295 commented 5 years ago

@jasnell

the amount of reserved runtime memory required by memory mapping the full ICU data set

This does affect the address space, but with 'full ICU by default' the full ICU data is just a single symbol (pure text) managed by the DLL loader… unless your address space is 30M from being full, what is the impact? There should be no change to the actual active memory area unless you actually use the locales.

Bundle the full-icu data with the node.js distribution such that it is always available but not used by default

I'm not clear on the benefit here. What does this impact?

The node.js binary would also Just Know™

9617

incremental on-demand loading of ICU data subsets

loading from disk or from network/etc?

srl295 commented 5 years ago

Just to repeat from discussion on IRC:

if you build ICU with full-icu, there's no mmap() involved. The data is just a large symbol loaded from the read-only pure text segment.

jasnell commented 5 years ago

if you build ICU with full-icu, there's no mmap() involved. The data is just a large symbol loaded from the read-only pure text segment.

Ah, right, thank you for the reminder on that. It's been a minute since I've thought about ICU related things.

The concern on the memory is specific to container and serverless environments. @MylesBorins could likely expand on that a bit more.

srl295 commented 5 years ago

@jasnell i wrote to @MylesBorins on IRC… it should only impact disk/network download size (and repo size). Should not impact to memory. But perhaps someone in those environments could help test?

ChALkeR commented 5 years ago

@jasnell

Enable the use of full-icu with a command line switch.

Perhaps an opt-out for systems where e.g. rss increase could be a significant concern (or where there are others) would be a better option than an opt-in for everyone? Or perhaps we could introduce that in a staged way with a runtime icu selection flag and just update the default value in a minor release as we go?

srl295 commented 5 years ago

so testing the docker image https://github.com/srl295/docker-node/commit/e99930011d698295c172f6919d93f4251081ad9e

$ docker images
REPOSITORY                         TAG                   …             SIZE
node                               12-alpine-small-icu   …        80.3 MB
node                               12-alpine-full-icu    …        105 MB
$ docker run -it --rm node:12-alpine-small-icu -p 'fs.statSync("/usr/local/bin/node").size'
45633344
$ docker run -it --rm node:12-alpine-full-icu -p 'fs.statSync("/usr/local/bin/node").size'
70282992

srl295 commented 5 years ago

@ChALkeR

Perhaps an opt-out for systems where e.g. rss increase could be a significant concern

two options:

configure … --with-intl=none or … --with-intl=small-icu - i don't think we should remove these options.
still providing a download option built with small-icu?

ChALkeR commented 5 years ago

@srl295, no, I mean a run-time command line flag similar to --with-intl configure option. Would that be viable?

nodejs / node

Building with full-icu by default #19214

Developer Experience

Repo Size

Compile Times

Downloader Experience

9617