huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 746 forks source link

Support PyArrow arrays as tokenizer input #1415

Open mariosasko opened 6 months ago

mariosasko commented 6 months ago

Most data processing libraries (Datasets, Polars, Pandas, DuckDB, etc.) are integrated with PyArrow, so native (zero-copy if possible) support for PyArrow arrays as input to avoid the unnecessary PyArrow to Python/NumPy conversion (pretty slow for string arrays) would be nice.

PS: PyArrow has recently added support for the PyCapsule interface, which should help with the implementation.

ArthurZucker commented 6 months ago

Would you like to open a PR for this? 🤗

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

mariosasko commented 5 months ago

Marking this issue as a good first issue. If it doesn't get addressed after a while, I'll take a stab at it.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

WenheLI commented 3 months ago

Interested in this. Can someone get it assigned to me?

mariosasko commented 3 months ago

@WenheLI Assigned :)

WenheLI commented 3 months ago

@mariosasko - Sorry just saw this! Can you guide me how to get started as I am still new to this project! Thanks a lot for your help!

mariosasko commented 3 months ago

Sure! The idea is to use the arrow crate (e.g., with ArrayData.from_pyarrow) to decode PyArrow StringArray/LargeStringArrays (when they are given as input to the Tokenizer). You can find the relevant code here (maybe this PR can also help, which has done the same thing for NumPy arrays).

To build the project, check this workflow file, in particular the part that installs the dependencies.

shreya-51 commented 2 months ago

hello! @WenheLI are you still working on this?

WenheLI commented 2 months ago

@shreya-51 - Hi! Sorry for the late reply. And yes, I am still working on that

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.