nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.43k stars 29.01k forks source link

Building with full-icu by default #19214

Closed silverwind closed 4 years ago

silverwind commented 6 years ago

Currently, users cannot rely on full i18n support to be present cross-platform and even cross-distribution mainly because different package maintainers use different configurations for ICU and if Node.js was built with system-icu one still has to have libicu installed. Browsers on the other hand generally do support full i18n out of the box.

There is the option to use the full-icu package but it is somewhat awkward to use as it requires a environment variable or commandline switch to work.

Building with full-icu is currently a ~40% increase binary size (on macOS, it goes from 35M to 49M). Is this an acceptable tradeoff? I'm thinking that if we build with it, ICU data should be moved in-tree so the build does not rely on external downloads.

cc: @nodejs/intl

jasnell commented 6 years ago

The binary size has always been the key issue with ICU, particularly on resource constrained devices where Node.js is expected to run. As long as we still have the ability to produce small-icu builds, my personal preference would be to build with full-icu by default.

devsnek commented 6 years ago

+1 to this, 35M to 49M is totally worth full i18n

mscdex commented 6 years ago

-1

devsnek commented 6 years ago

@mscdex can you elaborate? i assume you're -1 because of binary size?

srl295 commented 6 years ago

One option is to get the full-icu package working without options. I need help getting the platform part to work.

srl295 commented 6 years ago

https://github.com/nodejs/node/issues/3460

mscdex commented 6 years ago

@devsnek Yes.

TimothyGu commented 6 years ago

I'm also -1 on this for the same reason as @mscdex. I'd rather find a way to fix #3460 instead.

bnoordhuis commented 6 years ago

I was strongly against full-icu when ICU was introduced but I've since made a 180 flip. Ease of use and being able to rely on it just being there outweigh the downside of a bigger binary.

The binary size doesn't particularly trouble me anymore. If that was really an issue we would build without debug symbols or make them available as a separate download, but we don't.

Bigger runtime memory footprint is something that should be checked. ICU is pretty parsimonious though, what I know of it.

silverwind commented 6 years ago

As long as we still have the ability to produce small-icu builds, my personal preference would be to build with full-icu by default.

Yes, building the smaller variants should still be an option. Thought, I don't see platforms where a few MBs more are an issue as the primary consumers for Node.js.

One option is to get the full-icu package working without options.

That would be an improvement, but it's still an issue that users first have to discover this package, probably after wondering why i18n APIs don't work like they do in browsers.

As a package author, I'm reluctant to include full-icu as dependency because of application size concerns (which must be carried by every application as opposed to once if it were built into Node.js).

srl295 commented 6 years ago

@silverwind

users first have to discover this package

it could be a checkbox in the installer to install… or a suggested package within ubuntu, etc

which must be carried by every application

couldn't it be deduped via npm?

I was strongly against full-icu when ICU was introduced but I've since made a 180 flip. Ease of use and being able to rely on it just being there outweigh the downside of a bigger binary.

I wasn't trying to do a bait and switch :) full by default I like for many reasons, simplicity, cultural-linguistic correctness, dropping the words 'English only' from the vocabulary , etc.. But I do understand that the size issues are real. Not insurmountable but real.

devsnek commented 6 years ago

the way i see this, node has been working with negative size constraints because it was always missing full-icu. i build with full-icu everywhere anyway and like @silverwind said the primary consumers of node are not the people who can't download 40mb binaries. that being said, we can still keep our small-icu builds and just add full-icu builds

srl295 commented 6 years ago

ICU should be moved in-tree so the build does not rely on external downloads.

One option as far as download could be to have two sets of downloads (small and full). That means complexity for the build and website project, though.

Some statistics: full-icu gets 47k downloads a month, the underlying data package icu4c-data 41k a month. In Github, 112 repos and 45 packages depend on full-icu.

silverwind commented 6 years ago

ICU is already in-tree, but is specially trimmed to just have small data. Part of the size issue was repo size. So, one step could be include full data in the repo

Yes, that's what I meant, the actual dataset. I don't think we'll need LFS or other nonstandard git extensions. Just need make sure that there are no unneccesary files in the tree.

mhart commented 6 years ago

Two use cases for smaller binaries: docker containers, and serverless bundles (some ppl package the node binary in their Lambda so they don't need to rely on AWS' outdated versions – even more would if the size was smaller, and bundle size is a big factor in startup speed)

As the maintainer of https://github.com/mhart/alpine-node I'm going to be in a bit of bind with this one – I could just force the builds to be small-icu, but then there'll be a slew of issues from users who haven't expected it to diverge from the other Linux builds. I could also offer the option for a small and a full build, but then I'm doubling the work that needs to be done each release.

I'm one of the ppl that has despaired as the binary size has crept up from 18MB over the years 😞

srl295 commented 6 years ago

@mhart:

I could also offer the option for a small and a full build, but then I'm doubling the work that needs to be done each release.

This is hopefully less than double the work, but you could force the builds to be small-icu, and then have another layer(/tag) that just adds the full datafile (via npm i [-g] full-icu perhaps) and sets NODE_ICU_DATA before launching node. Then it's at least not 2x the full container size to maintain.

mhart commented 6 years ago

@srl295 ooh, that's an interesting option actually. So will the full-icu builds be basically 100% compatible with someone installing the full-icu npm module? In that case... I'm not sure why we even include it in node core...? 🤔

devsnek commented 6 years ago

i'm still unsure why we have to stop providing small-icu builds if we add full-icu builds. in my mind we can provide both for every release? all this talk about binary size could even get us on track for streamlining "normal" and "small" node releases with several factors such a debug symbols and icu data etc.

srl295 commented 6 years ago

@mhart the full-icu npm module just helps you get the right binary data file. Once that's there, node starts up and uses that data, rather than the embedded English-only content. A --with-intl=full-icu build on the other hand builds ICU normally, and doesn't strip out the non-English content.

srl295 commented 6 years ago

@devsnek

in my mind we can provide both for every release?

that's an option I mentioned in https://github.com/nodejs/node/issues/19214#issuecomment-372331186 — it could certainly be done that way.

mhart commented 6 years ago

@srl295 what I was getting was more: if this is available as a module, then I think it's worth thinking long and hard about adding it to core in the first place – considering that 95-100% of it won't be used by most node applications.

@silverwind are there some compelling arguments besides "browsers have this built-in"? Are there an unusual number of issues that arise from the current level of icu support?

srl295 commented 6 years ago

@mhart There isn't a 'real' module. It's not modularizable. The full-icu npm module by side effect of installing it causes the data file to be put somewhere, and the end user still has to tell node before startup to go look for it. https://github.com/nodejs/node/issues/3460 mitigates that a little bit, by making node look in a well known place (help needed).

considering that 95-100% of it won't be used by most node applications.

Maybe, but more and more functionality is going through ICU. But as I said above and elsewhere, the pain point of package size is also very real.

There's also the less tangible questions such as, how many developers' experience is impacted by lack of support out of the box for their language, such as https://github.com/nodejs/node/issues/13907

Expected result (from Chrome 58.0.3029.110): '25 июня 2017 г.' Actual result:

v8.1.2: '2017 M06 25' v7.9.0: 'June 25, 2017

What if the user just walked away from Node instead of filing a report? some other quotes from the above and linked issues:

Edit: nevermind, from searching around I discovered that the ICU auto-detection thing is sorta bunk and that I need to explicitly point Node at the ICU package.

Thanks for info. But there's a problem with full-icu package: unicode-org/full-icu-npm#6 (comment) Node v8 not auto detects it..

mhart commented 6 years ago

@srl295 guess i'm very late to the party here, and this is probably off-topic, but is there a way for it to be loaded lazily? ie, post startup? Not looking for a solution for the standard distribution of the Node.js binary with this question, but more for my own knowledge to know if users could potentially install a stripped down version of node with no icu (or system icu) and then require a full icu later if they wanted.

srl295 commented 6 years ago

Define 'late' :) (old context: https://github.com/nodejs/node-v0.x-archive/issues/7676 and https://github.com/nodejs/node-v0.x-archive/issues/6371 )

is there a way for it to be loaded lazily? ie, post startup?

no.

that's the difficulty in https://github.com/nodejs/node/issues/3460 - it can't be implemented as a normal module, or using javascript to do the discovery. everything must be setup before ICU is called first. The only alternative is to shut down everything that is using ICU (including all Intl objects) and unload/reload ICU. All intl objects become unusable. That seems untenable.

mhart commented 6 years ago

@srl295 Is there anything internal that calls ICU in the normal Node.js startup though? So long as it's the first thing a user does in their Node.js script, it should be fine... right?

mhdawson commented 6 years ago

I see a couple of different issues being discussed

1) What should be the default? Should full-icu be the default because otherwise, we may lose people because the initial experience is not as good or is it ok to have people do an extra step to download/opt into full icu.

2) Is download size an issue. If download size is not a problem, but the installation size is, then a larger download binary followed by stripping out of the ICU data file might be an option. If download size does matter, then the only way to make full ICU the default but retain the option for smaller downloads is 2 downloads. That would be more work although we might be able to build/test once and simply package twice by creating one package with the ICU data file and one without.

I don't have enough data to argue for one side or the other on either of the 2 questions yet.

Do we think trying to survey Node.js users through an online survey would provide good enough data to help make the decision? (FYI @dshaw, in case we want help from the user-feedback group to do a survey). I so aww @jasnell has a twitter poll to get some data but something broader might also make sense,

srl295 commented 6 years ago

@mhart v8 links against and loads ICU. I don't think there's any such guarantee.

silverwind commented 6 years ago

@silverwind are there some compelling arguments besides "browsers have this built-in"? Are there an unusual number of issues that arise from the current level of icu support?

Not really, I guess. In my case (linked at the top of this issue), I was just noticing inconsistent locale support when formatting numbers/dates across platforms and thought that should be fixed.

I guess number and date formatting would be the most desireable feature that requires full-icu.

bnoordhuis commented 6 years ago

Observation: apt-get install nodejs has better i18n support than our own builds, courtesy of linking against the system ICU.

TimothyGu commented 6 years ago

Hmm, maybe we could consider autodetecting if the host system has icu installed, and use the system icu if so?

bnoordhuis commented 6 years ago

That makes the binary non-distributable to other systems. We don't link to the system openssl for the same reason.

bnoordhuis commented 6 years ago

We know a full-icu binary is about 40% bigger and I've analysed the size impact on the source.


deps/icu-small is currently about 22M. 2.6M of that is the data file, 19.4M is source code.


ICU 60.2 can be trimmed down to 18.5M source code because source/tools is not needed any more.

Its data file is 26M. Git compresses it to about 11M. xz a.k.a lzma compresses it down to 6.2M.

We could check in the .xz and decompress at build time but I don't know: that only affects the size of a git checkout, the source tarballs are already LZMA-compressed. Probably not worth the complexity.


Impact on compile times: negligible on platforms that support the .incbin assembler directive, we can just include the data file verbatim. Probably even a little faster than it is now because there's no post-processing.

On Windows we have to do a bin2c conversion that produces a ~60M source file and takes 15-30 seconds to compile.

As an optimization we could write out a .obj directly instead of going through an intermediate .c file first. I haven't tried that yet.


To summarize:

  1. Source tarballs would be about 5.5M or about 30% bigger than they are now.

  2. A new git checkout would be about 11M bigger but it's already on the order of ~450M.

srl295 commented 6 years ago

@bnoordhuis might be good to coordinate. I have some comments on your approach above anyway.

As an optimization we could write out a .obj directly instead of going through an intermediate .c file first. I haven't tried that yet.

I guess my comment on this got lost. We should already be doing that, where do you see this not being true?

OK, i looked up bin2c and see it's another tool. Actually, ICU already does its own direct conversion to .obj files on windows and should on other platforms also.

srl295 commented 6 years ago

@bnoordhuis I implemented using dependencies.txt to fetch the list of ICU sources: https://github.com/nodejs/node/compare/master...srl295:icu-use-depslist

gfx commented 6 years ago

Any update for nodejs v10?

I need nodejs with full-icu, anyway.

gfx commented 6 years ago

Finally, I have started to use https://github.com/unicode-org/full-icu-npm in my project. It just works.

ex.

NODE_ICU_DATA=`$(npm bin)/node-full-icu-path` node ...
Vincz commented 6 years ago

Any update about this or @srl295 proposal with the auto detection of the full-icu (see here)? It would be great to have the full icu built-in and don't have to worry about it. @bnoordhuis made a good description of the situation.

devsnek commented 6 years ago

i think the next step here would be someone opening a pr and getting more immediate feedback from that. @srl295 or @silverwind are y'all interested in doing that? if not i'll label this as help-wanted

srl295 commented 6 years ago

A pr for full icu sounds good. I'd keep it simple and not reimplement icu data packaging.

El El mié, may. 23, 2018 a las 7:43 PM, Gus Caplan notifications@github.com escribió:

i think the next step here would be someone opening a pr and getting more immediate feedback from that. @srl295 https://github.com/srl295 or @silverwind https://github.com/silverwind are y'all interested in doing that? if not i'll label this as help-wanted

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/nodejs/node/issues/19214#issuecomment-391569648, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0Ms2UPpmW-ghMjebNbnJOnpRzc0Om7ks5t1h5fgaJpZM4ShQ05 .

silverwind commented 6 years ago

I'd prefer if @srl295 does it, but if he's busy, I guess we'd be open to contributions.

bnoordhuis commented 6 years ago

I have a work-in-progress in https://github.com/bnoordhuis/io.js/tree/fix19214 but I don't expect I'll get back to it this month.

srl295 commented 6 years ago

yes.. @bnoordhuis it has some interesting things, but reimplements a lot of machinery and won't work everywhere.

I am interested- would simplify a lot of things. I'll try a more minimal delta and get some comments.

yankouskia commented 6 years ago

Hi everyone! Are there some updates regarding this? It would be really awesome to have full-icu built-in!

srl295 commented 6 years ago

would putting icudt*.dat into LFS make any sense? seems like a good use case, but of course it adds complexity (ICU is deploying LFS for its source)

devsnek commented 6 years ago

using lfs sounds fine to me. would we be able to overwrite the pointer file with the configure step if needed? (system doesn't have lfs installed)

srl295 commented 6 years ago

@devsnek one could always pass a URL to download URL and just redownload ICU if lfs in't installed. Or I guess configure could detect the situation - icudt##l.dat is preposterously small (<10k) - is LFS installed?

srl295 commented 6 years ago

Open issues: (check off as they are decided)

PazzaVlad commented 6 years ago

+1 for making full-icu by default. Also when you install Node via Brew, it already has full-icu.

felixfbecker commented 5 years ago

full-icu doesn't have to be the default, but could Node provide distributions of full-icu builds? i.e.

This allows version managers like brew or nvm to provide a switch --with-full-icu that then downloads the full-icu binary / provides it to the MSI

srl295 commented 5 years ago

@felixfbecker that would be an option that would have no code change here.