Closed silverwind closed 5 years ago
I vote for full-icu to be the default. It seems to be how every other JS environment operates, and not having it just causes issues with Node. Tests run everywhere, but then fail on the CI service because of this.
+1 for full-icu by default.
+1 for full-icu by default.
This topic came up during the i18n session at the OpenJS summit, and as a follow-up I put together this alternative to packaging in the full icu data to the initial binary, while still retaining access to all locales:
https://github.com/eemeli/proposal-intl-loadlocales
So the idea here is to make it possible to lazy-load locale data into the Intl object. The actual loading process and data structure is completely hidden from the application-level API, allowing it to be configured and cached separately.
Given the way icu initializes data, it's not clear if this is possible. What information will be lazy loaded and what is the source?
Is there any major disagreement to doing this, and is anything blocking it?
Also: can anyone tell me where I can find a list of the subset of locales that are included in Node by default?
@srl295 can you comment on the default locales?
The default subset is: English
What's blocking this consensus to increase the repo size when there's already workarounds.
The default subset is: English
It seems like it's just U.S. English (en-US). Taking this example:
const currencyFormatter = new Intl.NumberFormat('en-CA', {
style: 'currency',
currency: 'CAD',
minimumFractionDigits: 0,
maximumFractionDigits: 2,
});
console.log(currencyFormatter.format(0.02));
In Chrome (while in Canada) I get the expected result:
$0.02
But with Node I get:
CA$0.02
It would be a lot nicer for everyone outside of the U.S. if Node included full-icu by default.
@nwoltman You are correct. It is en
which is effectively en-US
.
It would be a lot nicer for everyone outside of the U.S. if Node included full-icu by default.
Hey, it could be nicer for people inside the US also!
So there are a couple of issues, depending on what you mean by 'default'. (some of these are discussed above, and in the original tickets going back to 2013)
--with-intl=full-icu --download=all
.configure
? What needs something downloaded? Including the full data means the git repo is larger, and grows by a larger amount every year.. and, possibly means turning git-lfsfull icu
dataset when you download node.These 3 are somewhat independent. and there are downsides:
- adding a full icu download is less nice for the people who run the download site.
Counterpoint: Not having a full icu download is less nice for people who need it.
Making there be only a full icu download is less nice if people are bandwidth constrained.
Is there any data to verify this point? Who would be "bandwidth-constrained" enough that they couldn't download a few more MBs?
- making a larger git repo is less nice for devs.
This seems to be one of the biggest sticking points on this issue. It's definitely a tradeoff. Can anyone estimate how much "less nice" the repo will be for devs if the full icu data were added to it (both with and without git-lfs)?
Also, would making things less nice for Node devs perhaps be worth it to make things more nice for app devs?
- adding complexity to the installer could be less nice, especially towards download speeds.
Again, how much "less nice" would this actually be? Is it possible to estimate how much slower downloads will be? For me, adding 15 MB slows things down by 5 seconds, which definitely isn't a big deal (especially when compared to running npm install
).
Overall, my preference would be for the default Node.js binary to come with full icu (which would be great since that would make Node more consistent with browsers) and I have no opinion on installers/packagers providing a small-icu version (since I would never use it).
Is there any data to verify this point?
I did an analysis last year in https://github.com/nodejs/node/issues/19214#issuecomment-374216527. You can probably add a few percent because ICU gets bigger, not smaller, as time goes on.
Thanks @bnoordhuis, that data is awesome!
Based on that information, it looks like we can conclude the following:
A new git checkout would be about 11M bigger but it's already on the order of ~450M.
So it sounds like git-lfs would not be needed, and the increase to the repo size wouldn't have a noticeable impact.
negligible on platforms that support the
.incbin
assembler directive... Probably even a little faster than it is now because there's no post-processing. On Windows we have to do a bin2c conversion that produces a ~60M source file and takes 15-30 seconds to compile.
So compile times would not change or might improve for people on Unix systems. The slowdown on Windows would be unfortunate, but my guess is that most people would be compiling Node on a Unix-like system.
Source tarballs would be about 5.5M or about 30% bigger than they are now.
5.5M isn't very much. This probably wouldn't have an impact on most downloaders.
Based on that data, it sounds like the impact of switching to full-icu by default would be mostly negligible (with the only noticeable downside being slower compile times on Windows).
Nathan how do you usually install node?
Bandwidth constraints are one, also small devices.
The bandwidth bill for https://nodejs.org/ might be another issue.
The bandwidth bill for https://nodejs.org/ might be another issue.
use cloudflare
Nathan how do you usually install node?
Either downloading from the website or with the apt
package manager (or sometimes relying on the Node.js Docker image). I understand that the increase in bandwidth won't affect me personally, so I am probably unaware of how it would affect other people as well as the services that need to distribute the binary.
The main question is, who exactly would be affected and by how much? With that answered, the negatives of switching to full-icu by default could be compared to the positives, and then things could move forward (or at least this issue could be closed).
@zuohuadong That's an interesting point. https://nodejs.org/ actually does use Cloudflare. Although, I'm not seeing a Cache-Control
header when downloading the binaries, so I'm not sure if every binary download request is actually going to the nodejs.org servers.
I don't think we use CloudFlare for the downloads because we've not had time to do the work that it would take to continue to capture the download metrics. @rvagg can correct me if I'm wrong.
@nodejs/tsc I'm going to add to the agenda so we can get some feedback on what people think about potentially increasing the download size. If we get enough feedback on the issue before the next meeting we can remove.
Don't use cloudflare for /download/, but if this goes through it would be good incentive to get around to sorting out the blockers for that.
Hi there, just wanna share my POV (primarily working experiences) about the shortcomings of distributed Node w/o full-icu:
full-icu
has to dynamically download icu4c-data based on OS and a lot of corporate CIs don't allow Internet access & have everything vendored either via a mirror or checked in. This creates an issue in build pipeline for products using Node supporting multiple OSes & versions.bazel
(Google & DBX happen to use it). This also becomes very hard to scale as devs have to do so for their local dev envs as well.I do want to echo the concern mentioned earlier with increasing the binary size by 40%. as others have stated this may have negative effects on cold start times for containers which would effect serverless environments.
I've reached out to some of the engineers internally at google to get their take on how drastic a change like this could be.
Thanks for the context! I'd love to push for distributing a full-icu version in addition to the default. I understand that this creates extra work in the distribution phase but would be curious to learn what the LOE on that would be.
I do want to echo the concern mentioned earlier with increasing the binary size by 40%. as others have stated this may have negative effects on cold start times for containers which would effect serverless environments.
I don't expect that to be an issue. It's static read-only data that gets paged in on demand. You pay for what you use.
@bnoordhuis the design is that you only pay for what you use, and that it is demand paged. In a multitenant environment, it's probably better to share a larger node
binary among multiple tenants, rather than have each tenant have their own downloaded copy of the full-icu
data which I would guess won't be shareable even if they are identical copies.
I don't know how this translates to serverless, but i would imagine the larger node
binary ought to be shareable. The data in question is marked as RO
in the segment (pure text) for this reason.
We did discuss this in the TSC meeting today. From the minutes:
Building with full-icu by default #19214
Some concerns expressed last week.
binary size
code size
Michael to take action to send email on TSC email list to ask if there are objections to suggestion a PR be opened to make full-icu the default and to complete the conversation there.
One thing that was mentioned was having good numbers for the impact on the the following will be needed:
@mhdawson Is https://github.com/nodejs/node/issues/19214#issuecomment-374216527 sufficient? It's from last year but the numbers won't have changed dramatically in that time.
@bnoordhuis thanks, From https://github.com/nodejs/node/issues/19214#issuecomment-374216527 the bottom line was:
The remaining numbers that I think would be good to have in the summary is the additional size on disk once extracted and the in memory impact if you don't use any additional languages which from some of the discussion will not be the same as the additional size on disk.
Numbers I got today (Linux x64):
ICU | node binary size | loaded memory | execution time |
---|---|---|---|
small | 43MB | ~28MB | ~0.02s |
full | 66MB | ~28MB | ~0.02s |
To measure loaded memory, I ran ./node -p 'process.memoryUsage().rss / 1024 / 1024'
To measure execution time, I ran time ./node -e '1+1'
@targos thanks. I guess the other number is the size of the release tarball (which I assume is smaller than the node binary size since its compressed) as what @bnoordhuis provided was the source tarball sizes.
does full icu need to be in the repo? could it be the default for download but not build from source?
Given those numbers, I'm in favor of just bundling full-icu by default.
FWIW ICU releases are generally on a 9 month cycle. v8 is pretty aggressive about picking up new versions due to new ecma402/ecma262 requirements.
sgtm for full icu.
Il giorno gio 15 ago 2019 alle 22:55 Steven R. Loomis < notifications@github.com> ha scritto:
FWIW ICU releases are generally on a 9 month cycle. v8 is pretty aggressive about picking up new versions due to new ecma402/ecma262 requirements.
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node/issues/19214?email_source=notifications&email_token=AAAMXY6BJKBZQ4PU4YTSPMLQEW7D7A5CNFSM4EUFBU42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4M67GI#issuecomment-521793433, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAMXYYDDTDIDGWD5VB6T6TQEW7D7ANCNFSM4EUFBU4Q .
If full ICU is in the node repo, would the plan be to move to git-lfs? or just check in the 26M file as a binary blob icudt*.dat
without lfs? ( ~6M with xz, ~10M with gzip, ~9.5M with bz2 )
I strongly prefer that we simply check in the (compressed) blob. git-lfs == extra friction.
We can use a python script to decompress it if we don't want a dependency on gunzip(1) or bunzip2(1).
there's a https://docs.python.org/3/library/lzma.html can we use it?
Unfortunately no. lzma
is python3 only.
Does building with full-icu impact heapsnapshot sizes, or memory overhead while taking one?
It shouldn't. ICU data is memory mapped and not allocated. It may bump rss up but should have no impact on heap or generation of heapsnapshots
On Fri, Aug 16, 2019, 20:06 Jeremiah Senkpiel notifications@github.com wrote:
Does building with full-icu impact heapsnapshot sizes, or memory overhead while taking one?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node/issues/19214?email_source=notifications&email_token=AADLM6PT74AOIZIO4KUYOT3QE5TMTA5CNFSM4EUFBU42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QCAWQ#issuecomment-522199130, or mute the thread https://github.com/notifications/unsubscribe-auth/AADLM6LM46XMKEN2XMRDZPLQE5TMTANCNFSM4EUFBU4Q .
Maybe this is too radical, but you could consider turning off .gz offerings and forcing .xz only from the point you start offering full-icu onward, that would save on storage and a little release build time, while encouraging users toward a bandwidth saving option too.
Discussed briefly on the TSC call today. The arguments for or against enabling full-icu by default have not really changed much. If we ignore the binary download size issue for just a minute, one of the more significant issues is the amount of reserved runtime memory required by memory mapping the full ICU data set. One commpromise approach we could take would be:
This approach would be nearly identical to the existing approach using the full-icu npm distribution but would bundle the full-icu data set rather than having it installed manually. The node.js binary would also Just Know™️ where to look for the data set rather than the user having to provide the full path.
Longer term, what would really help advance this along would be making changes internally to ICU that allow incremental on-demand loading of ICU data subsets. It's been talked about for quite some time but has not yet been implemented. Maybe (ahem, google) some company (ahem, microsoft) could dedicate (ahem, ibm) some resources to making those changes long term?
@jasnell
the amount of reserved runtime memory required by memory mapping the full ICU data set
This does affect the address space, but with 'full ICU by default' the full ICU data is just a single symbol (pure text) managed by the DLL loader… unless your address space is 30M from being full, what is the impact? There should be no change to the actual active memory area unless you actually use the locales.
Bundle the full-icu data with the node.js distribution such that it is always available but not used by default
I'm not clear on the benefit here. What does this impact?
The node.js binary would also Just Know™
incremental on-demand loading of ICU data subsets
loading from disk or from network/etc?
Just to repeat from discussion on IRC:
if you build ICU with full-icu, there's no mmap() involved. The data is just a large symbol loaded from the read-only pure text segment.
Ah, right, thank you for the reminder on that. It's been a minute since I've thought about ICU related things.
The concern on the memory is specific to container and serverless environments. @MylesBorins could likely expand on that a bit more.
@jasnell i wrote to @MylesBorins on IRC… it should only impact disk/network download size (and repo size). Should not impact to memory. But perhaps someone in those environments could help test?
@jasnell
Enable the use of full-icu with a command line switch.
Perhaps an opt-out for systems where e.g. rss increase could be a significant concern (or where there are others) would be a better option than an opt-in for everyone? Or perhaps we could introduce that in a staged way with a runtime icu selection flag and just update the default value in a minor release as we go?
so testing the docker image https://github.com/srl295/docker-node/commit/e99930011d698295c172f6919d93f4251081ad9e
$ docker images
REPOSITORY TAG … SIZE
node 12-alpine-small-icu … 80.3 MB
node 12-alpine-full-icu … 105 MB
$ docker run -it --rm node:12-alpine-small-icu -p 'fs.statSync("/usr/local/bin/node").size'
45633344
$ docker run -it --rm node:12-alpine-full-icu -p 'fs.statSync("/usr/local/bin/node").size'
70282992
@ChALkeR
Perhaps an opt-out for systems where e.g. rss increase could be a significant concern
two options:
configure … --with-intl=none
or … --with-intl=small-icu
- i don't think we should remove these options.small-icu
?@srl295, no, I mean a run-time command line flag similar to --with-intl
configure option. Would that be viable?
Currently, users cannot rely on full i18n support to be present cross-platform and even cross-distribution mainly because different package maintainers use different configurations for ICU and if Node.js was built with
system-icu
one still has to havelibicu
installed. Browsers on the other hand generally do support full i18n out of the box.There is the option to use the
full-icu
package but it is somewhat awkward to use as it requires a environment variable or commandline switch to work.Building with
full-icu
is currently a ~40% increase binary size (on macOS, it goes from 35M to 49M). Is this an acceptable tradeoff? I'm thinking that if we build with it, ICU data should be moved in-tree so the build does not rely on external downloads.cc: @nodejs/intl