node-unicode / node-unicode-data

JavaScript-compatible Unicode data generator. Arrays of code points, arrays of symbols, and regular expressions for every Unicode version’s categories, scripts, blocks, and properties — neatly packaged into a separate npm package per Unicode version.
https://mths.be/node-unicode-data
MIT License
145 stars 15 forks source link

Unicode is big =/ Let's split it up... #14

Open iarna opened 10 years ago

iarna commented 10 years ago

This module is necessarily huge (108MB installed, for 6.3.0!), as it encompasses all of unicode. In practice though, I find that I only want a relatively small subset of it at any given time. I would propose making it build split into separate modules for each of the individually loadable pieces, such that a module could depend on unicode-6.3.0-categories-L-regex. From a usage point of view, this seems nearly exclusively a win. The one possible downside would be a huge proliferation of new modules that might, given the current state of npm search, make finding the right module for a new user even more difficult.

Would there be interest in a pull-request to this effect? I'd be happy to put that together if there was interest in using it.

aredridel commented 10 years ago

:+1: This would be a huge win I think.

mathiasbynens commented 10 years ago

Out of curiosity, are you using these packages as part of build scripts or as run-time dependencies? I’ve only used them for build scripts (in my dev environment) so far, and in such cases the file size doesn’t really matter.

Making hundreds of packages for each Unicode version sounds like overkill IMHO. Would it help to just make separate packages for unicode-6.3.0-categories, unicode-6.3.0-scripts, unicode-6.3.0-blocks, unicode-6.3.0-bidi, unicode-6.3.0-bidi-mirroring, unicode-6.3.0-bidi-brackets?

(Btw, I would be interested to see how you’re using these packages!)

mathiasbynens commented 10 years ago

FWIW, the download size of the tarball for the unicode-6.3.0 module (http://registry.npmjs.org/unicode-6.3.0/-/unicode-6.3.0-0.1.0.tgz) is ~16.49 MB. Not too bad IMHO.

iarna commented 10 years ago

I'm currently using the module as a runtime dependency for a lexer. Specifically, for the category L regex.

Size-wise, it may compress to 16.49MB compressed, but unextracted it's 108MB– it's a substantial drag on build/install times, and it hugely clutters up end-user apps that have added their node_modules as part of their release process (eg http://www.futurealoof.com/posts/nodemodules-in-git.html et al).

And beyond that, it seems absurd to me to require 108MB of JS just because I need a 5KB regex.

iarna commented 10 years ago

Personally, I'd be ok with exposing only the regex interface this way, as its substantially faster than the other options, and I'd be hard pressed to name a use case where you'd want the others instead of it.

mathiasbynens commented 10 years ago

Interesting. I never intended the data packages to be used for anything other than build scripts.

Have you considered using regenerate in tandem with one of these packages, as devDependencies (rather than dependencies) as part of a build script (rather than at run time)? E.g. you could have a script named build-regex.js that does something like:

var regenerate = require('regenerate');
var L = require('unicode-6.3.0/categories/L/code-points'); // or `…/symbols`, doesn’t matter
console.log(regenerate(L).toString());
// Then, maybe write this to a file

This won’t solve all the problems you outlined but it would solve some. What do you think?

mathiasbynens commented 10 years ago

Personally, I'd be ok with exposing only the regex interface this way, as its substantially faster than the other options […]

Hmm, yeah, maybe we could move/copy the regular expressions to a separate unicode-6.3.0-regex package. The regex data indeed only takes up a small percentage of the package’s total file size.

and I'd be hard pressed to name a use case where you'd want the others instead of it.

FWIW, here’s a real-world example: https://github.com/ariya/esprima/blob/598d21bf26dfa3dd50b00b1a6c975d5612f4c8ab/tools/generate-identifier-regex.js

I could also imagine a situation where you want to construct a regex that matches all Unicode symbols in, say, the Arabic and Greek scripts.

var regenerate = require('regenerate');
var Arabic = require('unicode-6.3.0/scripts/Arabic/code-points'); // or `…/symbols`, doesn’t matter
var Greek = require('unicode-6.3.0/scripts/Greek/code-points'); // or `…/symbols`, doesn’t matter
var set = regenerate()
  .add(Arabic)
  .add(Greek);
console.log(set.toString());

// Then you might want to use a template like this to write the result to a file, along with any regex flags you might need:
// var regex = /<%= set.toString() %>/gim;
iarna commented 10 years ago

Eep, didn't mean to delete that comment. I wish issues were git repositories too. I wonder if I can recover it from email or RSS or suchlike...

cscott commented 10 years ago

@iarna I think i've got your deleted comment here in my email (thanks, github notifications). It was:

I don't think regenerate, as you outline it, would get anything me in particular– at least, not anything I wouldn't get just by having my build copy that one file out of unicode-6.3.0 directly.

(FWIW my actual use case: https://github.com/iarna/sql-lexer/blob/master/sql92/token-matcher-L0.js#L56)

As far as code points go, yes, I see what you mean– combining different classes together.

Out of curiosity, when would you actually use this as a part of a build script? Just generating use-case specific JS?

Given the current state of NPM– the inability to have submodules, I get not wanting to clutter things up. I'll experiment with how a regex only version would feel…

iarna commented 10 years ago

Thanks — wasn't in my email — guess you only get others updates in one's email.

cscott commented 10 years ago

And, responding to your earlier comment -- I would use this as part of a build script if I were implementing a spec which referenced a particular Unicode property. For example, the Java Language Specification says that valid identifier characters are members of a particular set of unicode classes. That set of characters doesn't change often, but it does change from time to time. I would probably use a build script to create the regex for "valid Java identifier character" based on the latest unicode data, and then just use that regex in my lexer.

iarna commented 10 years ago

Ok, so yes, just generating use-case specific javascript files.

So a regex only distribution would only be 1.4MB. And of course one can combine them, eg:

var Arabic = require('unicode-6.3.0/scripts/Arabic/regex');
var Greek = require('unicode-6.3.0/scripts/Greek/regex');
var set = new RegExp('['+Arabic.source+Greek.source+']');
console.log(set.toString());
mathiasbynens commented 10 years ago

There are a few problems with that approach:

P.S. Your sample code has some other bugs – the output is wrapped in an extra set of [] and there should probably be a | between Arabic.source and Greek.source. All the more reason to avoid doing this manually, IMHO.

mathiasbynens commented 10 years ago

Perhaps we should remove the regular expressions from the packages altogether. They’re not intended to be toString()ed and concatenated etc. anyway (https://github.com/mathiasbynens/node-unicode-data/issues/14#issuecomment-43635528). They’re ony useful when you need to match an existing Unicode category/script/block/… as-is at run-time, but in that case installing these big packages might be overkill as @iarna says.

iarna commented 10 years ago

Yeah, I blame early morning muzzyness for the character class mess in the example.

I don't think the overlap from:

var set = new RegExp(Arabic.source+'|'+Greek.source);

Is actually worth worrying about.

If .source is useless =( (although I don't see that– the bug shows that toString() is only useful for debug output, but it doesn't show the same for source) then having character ranges for use in character classes would provide a safer alternative– it might be superior too, as it'd allow concatenation in a single character class, which most regexp engines can optimize pretty well.

mathiasbynens commented 10 years ago

it might be superior too, as it'd allow concatenation in a single character class, which most regexp engines can optimize pretty well

Note that this is what the Regenerate-based approach boils down to. It uses a single character class if the set can compactly be represented that way.

Do you think using Regenerate is not an acceptable solution? Why?

iarna commented 10 years ago

Can regenerate work from regex style ranges? My impression was that it accepted character lists only, and those are the giant data I'm trying to avoid including in the first place. (On a phone or I'd check myself).

Edited now that I'm at a computer with internet access: The answer is no, it only takes characters and character codes, though it will accept ranges. Of course, the regexes currently generated node-unicode-data are weirdly, not character classes, but rather alternations mixed with character classes.

iarna commented 10 years ago

Ok, given this discussion, this is my proposal:

What I want is a published module per version of unicode that has files that export only strings suitable for embeding in regex character classes. I could either do that as a patch to this module, or as a module that requires this one and is published separately. What's your preference?

mathiasbynens commented 10 years ago

Can regenerate work from regex style ranges?

I think this is what you mean: https://github.com/mathiasbynens/regenerate#regenerateprototypeaddrangestart-end

Of course, the regexes currently generated node-unicode-data are weirdly, not character classes, but rather alternations mixed with character classes.

That is necessary if the set contains astral symbols and the output needs to be compatible with existing ES5 implementations.

mathiasbynens commented 10 years ago

What I want is a published module per version of unicode that has files that export only strings suitable for embedding in regex character classes.

Why do you want the output to fit in a single character class? What would that character class look like if you wanted to match, say, U+010000 to U+10FFFF (or any other range of astral symbols)? It is impossible to do that in a single character class in an ES5 compatible regex.

iarna commented 10 years ago

Regarding ranges, no that's not what I meant, I meant inputting regex, but I had't read regenerate's docs yet. I was hoping it was more equivalent of Perl's regexp tools which are a bit more mature, eg Regexp::Assemble, Regexp::Optimizer, etc. Of course, Perl's never had to work around a broken/incomplete unicode implementation.

iarna commented 10 years ago

In so far as supporting astral characters goes, then I'd go back to doing as you're doing now and returning a regexp as it is today.

If JS's regexp wasn't so brain damaged, the advantage of returning character-class compatible data is that you can use it like [\p{Arabic}\p{Greek}] in a unicode aware regexp engine.

mathiasbynens commented 10 years ago

If JS's regexp wasn't so brain damaged, the advantage of returning character-class compatible data is that you can use it like [\p{Arabic}\p{Greek}] in a unicode aware regexp engine.

XRegExp allows you to do something like that, but it’s not actively maintained anymore. (I’m responsible for its Unicode data, but in the HEAD version the Unicode data is outdated. I have a PR for it that has been pending for months).

I’m trying to understand why Regenerate doesn’t seem to be an option for you. It seems like you’re jumping through hoops just so you wouldn’t have to use it :) Anything I can do about that? Is it because you want to avoid the dependency on the big packages at all costs, even for devDependencies?

iarna commented 10 years ago

Ok, so as far as Regenerate and my specific use case go, 'cp' or equivalent would do the job just as well. So I don't really see the point.

As far as why I don't want it as any kind of dependency... have you actually tried installing it?

$ time npm install unicode-6.3.0
npm http GET https://registry.npmjs.org/unicode-6.3.0
npm http 200 https://registry.npmjs.org/unicode-6.3.0
npm http GET https://registry.npmjs.org/unicode-6.3.0/-/unicode-6.3.0-0.1.3.tgz
npm http 200 https://registry.npmjs.org/unicode-6.3.0/-/unicode-6.3.0-0.1.3.tgz
unicode-6.3.0@0.1.3 node_modules/unicode-6.3.0

real    1m18.246s
user    0m51.086s
sys 0m25.053s

Now, this does seem to point to some sort of npm brokenness, as doing roughly same steps by hand is MUCH faster:

$ time (wget -q https://registry.npmjs.org/unicode-6.3.0/-/unicode-6.3.0-0.1.3.tgz && tar xzf unicode-6.3.0-0.1.3.tgz && cd package && npm install)

real    0m6.681s
user    0m1.098s
sys 0m0.869s
mathiasbynens commented 10 years ago

I don't want it as any kind of dependency... have you actually tried installing it?

Yes, several of my projects depend on unicode-6.3.0, as do Esprima and JSHint for their identifier regex generation. It looks like I’m not having the issue you’re describing, though:

$ time npm install unicode-6.3.0
info trying registry request attempt 1 at 22:14:45
http GET https://registry.npmjs.org/unicode-6.3.0
http 200 https://registry.npmjs.org/unicode-6.3.0
unicode-6.3.0@0.1.3 ../node_modules/unicode-6.3.0

real    0m10.459s
user    0m7.574s
sys 0m4.078s

Note that I’m on a crappy hotel WiFi network as I write this.

Ok, so as far as Regenerate and my specific use case go, cp or equivalent would do the job just as well. So I don't really see the point.

The point would be moving unicode-6.3.0 from being a dependency to being a devDependency.

Thanks to this thread, I’m starting to think more and more the regex data should just be removed from the packages, as it doesn’t compose nicely, and it’s not meant to be used at runtime anyway.

iarna commented 10 years ago

As a devDependency it is a little better, but I'm still not fond of modules that can only be meaningfully used via a build step. I find that it substantially increases the friction to using the module. One of the joys of npm is how low the friction is to using (most of) the modules from it.

To me, reducing friction to using a module in an intended way is one of the core things that all modules should aspire to. Adding a build step when there was none substantially increases the cost of using a module, to the point that I'd be inclined to inline the piece I needed by HAND over adding that build step. Adding a 'make' based build step is easy, but would screw Windows users, so that means using one of the pure JS build systems, like grunt, and... that adds a whole new layer of weight to what was otherwise "I want to include this 4k of regex".

iarna commented 10 years ago

Yeah, I'm on a business line that gets about 3 MB a second from NPM. It's definitely not the download that makes it slow.

That test above was from my local Mac. HFS+ is a particularly crappy file system, so it's probably related in part to that. From a Linux VM hosted at Linode, it is a little better:

$ time npm install unicode-6.3.0
npm http GET https://registry.npmjs.org/unicode-6.3.0
npm http 304 https://registry.npmjs.org/unicode-6.3.0
unicode-6.3.0@0.1.3 node_modules/unicode-6.3.0

real    0m23.424s
user    0m16.320s
sys 0m10.435s

Although the above is still 6 times slower then the wget/tar/npm by hand combination on the same machine.

$ time (wget -q https://registry.npmjs.org/unicode-6.3.0/-/unicode-6.3.0-0.1.3.tgz && tar xzf unicode-6.3.0-0.1.3.tgz && cd package && npm install)

real    0m4.750s
user    0m2.257s
sys 0m1.240s
mathiasbynens commented 10 years ago

reducing friction to using a module in an intended way is one of the core things that all modules should aspire to.

I hear you, and I agree. I guess this is an exception though, since the intended usage for the unicode-data packages is for them to be used as part of build scripts, not at runtime.

Adding a 'make' based build step is easy, but would screw Windows users, so that means using one of the pure JS build systems, like grunt, and... that adds a whole new layer of weight to what was otherwise "I want to include this 4k of regex".

The build script could just be npm run build which triggers something like node create-regex-and-write-it-to-a-file.js which works on all platforms. See @substack’s http://substack.net/task_automation_with_npm_run.

iarna commented 10 years ago

And yet, I have to go write that. So again, friction.

As this package isn't intended for my use case, I'm going to go back to my second suggestion and make one that is that uses this one as a data dependency, that is, using regenerate as part of a build step. That way I only have to suffer the friction once, and we get a module that's exposing interfaces that are designed for runtime use.

mathiasbynens commented 10 years ago

@iarna You’ve created https://github.com/iarna/unicode-6.3.0-categories-L-regex, but now that Unicode v7.0.0 is out you probably want to use the newer data. Now you have to create a new project (as you cannot just rename the old one on npm) just to update the data. And next year, when Unicode v8.0.0 is released, the story repeats.

To me, this sounds like way more friction than using the build script-based approach I described above. That way, you could’ve just updated the devDependency from unicode-6.3.0 to unicode-7.0.0 in your lexer project, run the build script again, and have a new file generated with the regular expression to be used at runtime.

iarna commented 10 years ago

But I can in fact just rename it, because it's not published on npm. ;)

We never disagreed about programmatically generating the module I needed. We just disagreed as to when it should happen, eg, at publish time of unicode-x.x.x or at publish time of every module that ever needs unicode-x.x.x.

Personally, I still tend toward the former. I'm not keen on making library end users absorb the complexity of a build step. Most of my use cases don't have bulid steps, so this is a major addition. By contrast, if one lives in a front end world, one already has a build step, so no big deal.

All that said, on reflection, I don't currently favor my original request, due to registry clutter. This is why I didn't go ahead with my own take on this. I'm waiting for the npm registry to get collections (and maybe the ability to hide modules from general search) to reconsider.

mathiasbynens commented 10 years ago

I'm not keen on making library end users absorb the complexity of a build step.

What makes you think end users would be affected? Only the library developer (i.e. you) would have to run the build script, and then publish the generated file as part of the library’s package.

iarna commented 10 years ago

By end users, I mean "end users of your library". This might be another library (my use case) or it might be an application.

AlexanderZeilmann commented 9 years ago

Is there any plan right now of publishing the categories (especially the different letter categories) as independent node modules?

The problem of the unicode-data being to big came up in issue #921 for jscs. It increased the size of jscs by a factor of 18...

mathiasbynens commented 9 years ago

@alawatthe Thanks for the heads up. I believe this is not a problem that should be solved in the unicode-data project but rather in node-jscs: https://github.com/jscs-dev/node-jscs/issues/921#issuecomment-69986167 (The same discussion as in this thread, really.)

gagern commented 8 years ago

In the spirit of this thread here: I've toyed with a little compression scheme, and managed to get the L category codepoint list from 690kB to just under 2kB. I'm using ranges internally, so it would be fairer to compare against the regexp, but that's still 6.5kB. I'm squeezing out the rest using relative offsets, string literals and a custom dictionary. See all of this in a Gist as proof of concept.

Would it make sense to add this kind of representation to the modules generated here? So that each codepoint module essentially is something like

module.exports=require(../../util/decompress)([…], "…")

with the dictionary and string literal generated in a way similar to my proof of concept? The outside interface would remain the same, the download size might be reduced considerably. Simply copying one such file to a project of your own would no longer be an option, unless the decompression function is inlined into each of these files.

mathiasbynens commented 8 years ago

@gagern That’s pretty cool! Nice work.

If I’d go the route of compressing I’d probably export Regenerate sets, similar to what regexpu uses. Then people would have to call set.toArray() to get a proper array, though :( I’d prefer to keep the data uncompressed, for the following reasons: