jtoy / cld

compact language detection in ruby
BSD 3-Clause "New" or "Revised" License
210 stars 67 forks source link

Upgrade to CLD2 #9

Open cbandy opened 10 years ago

cbandy commented 10 years ago

See #8.

Before this is merged, we should update our licensing. The library has changed to the Apache license.

The size of the bundled library has grown significantly. The source itself is over 90 MiB. The gem is now 35 MiB (up from 6 MiB) and installed it uses 93 MiB (up from 17 MiB). If CLD2 ever releases a tarball, we can stop bundling it here and shrink the installed size to 2 MiB.

There are two possible CLD2 libraries to link against: libcld2.so and libcld2_full.so. The latter can detect twice as many languages and is 4 MiB larger. I arbitrarily chose the former, smaller library in this PR. Which would you prefer to be used by default? In either case, we can also make this configurable during gem install.

jtoy commented 10 years ago

wow, that is a very large gem! is there any way we can reduce this? 6mb was already too much.

cbandy commented 10 years ago

I found that some of the CLD2 source files are not necessary to build the libraries. The gem is now 17 MiB and installed uses 46 MiB. If we commit to just one of libcld2.so or libcld2_full.so, we can reduce this further.

The unavoidable fact is that the source contains large tables of pre-computed n-grams. cld2_generated_quad0122.cc is required to build libcld2_full.so and is 27 MiB. Gems are already compressed, so minimizing the number of these source files in the shipped gem is the only way to save bits.

If CLD2 were to release an archive/tarball, we could ship zero source files and download it before compiling the extension using something like mini_portile.

I looked into downloading bare files from the project repository, but we either need to

  1. depend on more tools (e.g. svn or wget) or
  2. maintain something approaching their complexity or
  3. maintain a list of source files/URLs to download.
cbandy commented 10 years ago

Another option is to ship binary/pre-compiled gems. At first pass, it looks like the smaller gem would be less than 2 MiB and the larger would be less than 5 MiB.

I don't have any experience releasing a binary gem.

mattdoller commented 9 years ago

Any chance there has been any progress or updates with this? I'd love to help out with this if possible.

adityapatadia commented 9 years ago

I would also like to contribute. Let's solve this issue asap. This issue p is pending for more than a year just because of size of CLD.

adityapatadia commented 9 years ago

Here is similar implementation in JavaScript. We can take cues from that: https://github.com/dachev/node-cld

craig-day commented 9 years ago

@jtoy can we reconsider this? The gem did get larger, but so did the source library. I don't think there is a clean way to avoid this and still allow anyone to use the gem.

mmahalwy commented 9 years ago

any update on this?

grosser commented 9 years ago

@craig-day can you merge and release this ?

craig-day commented 9 years ago

I'll take a look hopefully tomorrow or Monday morning at the latest.

On Oct 10, 2015, at 8:38 PM, Michael Grosser notifications@github.com wrote:

@craig-day can you merge and release this ?

— Reply to this email directly or view it on GitHub.

cbandy commented 9 years ago

CLD2 project has moved to https://github.com/CLD2Owners/cld2/

craig-day commented 8 years ago

@cbandy is this still ready to go? I'd like to merge and release a new major version.

cbandy commented 8 years ago

It has been a long time since I looked at this.

If CLD2 were to release an archive/tarball...

I still don't see a tarball; at least not one provided by GitHub tags/releases.

I looked into downloading bare files from the project repository...

Maybe this is more reasonable now that it hosted in Git? I forget how common it is for Gem installers to have git available.

cbandy commented 8 years ago

Should we pull in any changes to CLD2 since May 2014, if any?

This appears to be the revision/commit that I imported in this PR: https://github.com/CLD2Owners/cld2/commit/d076f5eda223ac568639d6288f2e2d70d908f282

craig-day commented 8 years ago

@cbandy can you update the readme link and pull in any new changes? I'm not sure if the tarball is a concern right now. I'd rather avoid a git dependency because not all places gems get installed have git (like production servers).

craig-day commented 8 years ago

As far as licensing, I think you can copy the apache license from the CLD2 owners. It looks like our original license was just copied from them anyway.

guilleiguaran commented 6 years ago

@cbandy I don't think this project will be updated, I suggest you to release your code as a new cld2 gem