huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.92k stars 777 forks source link

build(node): Include binaries in NPM packing #1459

Closed aaronclong closed 5 months ago

aaronclong commented 7 months ago

https://docs.npmjs.com/cli/v10/configuring-npm/package-json#files https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/ https://github.com/actions-rs/toolchain https://github.com/huggingface/tokenizers/issues/1403

Background


The native modules don't seem packed and published with the node build. There seem to be missing entries in the package.json file. This PR adds all the permutations from the index.js file. However, this could be cleaned up with future glob patterns, "glob pattern (*, */, and such) will make it so that file is included in the tarball when it's packed".

NPM Pack Results

Using the npm pack command, I could confirm the binaries are packed when running the build locally. Github doesn't allow the sharing of .tgz files in PRs. However, the CLI output has been included below.

PR's Pack Output

npm notice 
npm notice 📦  tokenizers@0.14.0-dev0
npm notice === Tarball Contents === 
npm notice 1.1kB  LICENSE                      
npm notice 2.0kB  README.md                    
npm notice 9.7kB  index.d.ts                   
npm notice 10.9kB index.js                     
npm notice 3.9kB  package.json                 
npm notice 7.7MB  tokenizers.linux-x64-gnu.node
npm notice === Tarball Details === 
npm notice name:          tokenizers                              
npm notice version:       0.14.0-dev0                             
npm notice filename:      tokenizers-0.14.0-dev0.tgz              
npm notice package size:  2.6 MB                                  
npm notice unpacked size: 7.7 MB                                  
npm notice shasum:        bba97857358d036d155ddb44aa1ddd3cd5629c6f
npm notice integrity:     sha512-sEodSqm4VaQXp[...]SoeVMMXnoT83Q==
npm notice total files:   6                                       
npm notice 
tokenizers-0.14.0-dev0.tgz

Main's Pack Output

npm notice 
npm notice 📦  tokenizers@0.14.0-dev0
npm notice === Tarball Contents === 
npm notice 1.1kB  LICENSE     
npm notice 2.0kB  README.md   
npm notice 9.7kB  index.d.ts  
npm notice 10.9kB index.js    
npm notice 3.0kB  package.json
npm notice === Tarball Details === 
npm notice name:          tokenizers                              
npm notice version:       0.14.0-dev0                             
npm notice filename:      tokenizers-0.14.0-dev0.tgz              
npm notice package size:  6.8 kB                                  
npm notice unpacked size: 26.7 kB                                 
npm notice shasum:        c54d1095caf5cb8681d9d06fe0bbe92176de3377
npm notice integrity:     sha512-jiDw2etLo5N9X[...]Bixc8Kgp/1TnA==
npm notice total files:   5                                       
npm notice 
tokenizers-0.14.0-dev0.tgz

Github Action Changes

In addition to all the above, I notice most of your actions for the release are deprecated. Github is switching to node20 for actions, and the legacy version might not fully support it.

One thing I couldn't update was the rust init step. The action, actions-rs/toolchain, being used is deprecated and archived.

aaronclong commented 7 months ago

@ArthurZucker could I be annoying and ask you to review this, please and thank you? I totally understand if this isn't a priority right now though.

aaronclong commented 7 months ago

@Narsil I see you are also an active maintainer in this repository. Could I bother you to take a look at this PR?

I'm sorry if I am being pushy at all, but would love to fix this problem for my own uses-cases

anupamamurthi commented 6 months ago

Wow, I seem to be running into the exact same issue -

 Cannot find module 'tokenizers-linux-x64-gnu' from 'index.js'

      at Resolver.resolveModule (node_modules/jest-resolve/build/index.js:169:17)
      at Object.<anonymous> (node_modules/tokenizers/index.js:178:31)

Can someone help review this PR and merge it?

aaronclong commented 6 months ago

@anupamamurthi I'm going to give it a few days, but I might publish a fork until the maintainers start reviewing PRs again.

I'm wonder if this repo is being actively maintained. I've tried several avenues to get someone to review it and the others pending. If you know of anything @anupamamurthi, I'm all ears.

anupamamurthi commented 6 months ago

hey @aaronclong let me know if u publish a fork, I will be happy to give your node package a try....

anupamamurthi commented 6 months ago

@aaronclong do u have a node package ready/handy?

aaronclong commented 6 months ago

@aaronclong do u have a node package ready/handy?

@anupamamurthi I do not at the moment. I was planning to try and work on that this weekend when time allows

aaronclong commented 6 months ago

Giving this one more shot, @ArthurZucker or @Narsil would it be possible to review this PR? I notice another PR was merged only two days ago. I'm very open to any and all feedback. I just personally want to resolve this issue for my own needs. Thanks again and please let me know.

loretoparisi commented 6 months ago

Getting the same error on both darwin and linux:

Error: Cannot find module 'tokenizers-darwin-arm64'
Require stack:

@Narsil any update on this issue? 🙏🏾

aaronclong commented 6 months ago

@anupamamurthi and @loretoparisi it still requires a lot of work, but I was able to publish a fork!!!

As time allows, I'm going to clean up the node code here. My future plan is to pull the tokenizer rust from cargo, and do the napi-rust binding in this repo. Likewise, I would love to try and do a WASM build.

However, you are welcoming to use it as it stands now. I can't speak to how functional it is because just focused on the build fixes.

https://github.com/turingscript/tokenizers https://www.npmjs.com/package/@turingscript/tokenizers