huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.67k stars 743 forks source link

feat: add support for pyarrow arrays as input #1535

Open notjedi opened 1 month ago

notjedi commented 1 month ago

closes #1415

notjedi commented 1 month ago

yet to add support for pyarrow.LargeString

notjedi commented 1 month ago

pulling arrow from git because the current stable version v51.0 links against v0.20 of pyo3 while the bindings link against v0.21. there is already a merged pull request(apache/arrow-rs#5566) which migrates from v0.20 to v0.21 of pyo3 which i think will be included in the next release of v52.0.

HuggingFaceDocBuilderDev commented 3 weeks ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

notjedi commented 3 weeks ago

I think we can leave the dep as optional wdyt?

makes sense, will get it changed @ArthurZucker

notjedi commented 3 weeks ago

i've added pyarrow as a feature in pyproject.toml. not sure if that is what we want here. let me know if i need to fix anything else. @ArthurZucker

notjedi commented 3 weeks ago

yeah will update that once there is a stable release of arrow (v52.0). not really, i don't have any benchmarks and the only example i have is in the tests. i don't need this personally, just wanted to close an open issue and contribute.

strategy155 commented 3 weeks ago

Good afternoon everyone. How do you think the optimal benchmark should look like? Native pyarrow vs numpy conversion for string arrays?

ArthurZucker commented 2 weeks ago

something like that yeah, I have no idea in what context pyarrow is used as I have not used it, but in optimal context usage