Open mariosasko opened 6 months ago
Would you like to open a PR for this? 🤗
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Marking this issue as a good first issue. If it doesn't get addressed after a while, I'll take a stab at it.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Interested in this. Can someone get it assigned to me?
@WenheLI Assigned :)
@mariosasko - Sorry just saw this! Can you guide me how to get started as I am still new to this project! Thanks a lot for your help!
Sure! The idea is to use the arrow
crate (e.g., with ArrayData.from_pyarrow
) to decode PyArrow StringArray
/LargeStringArray
s (when they are given as input to the Tokenizer
). You can find the relevant code here (maybe this PR can also help, which has done the same thing for NumPy arrays).
To build the project, check this workflow file, in particular the part that installs the dependencies.
hello! @WenheLI are you still working on this?
@shreya-51 - Hi! Sorry for the late reply. And yes, I am still working on that
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Most data processing libraries (Datasets, Polars, Pandas, DuckDB, etc.) are integrated with PyArrow, so native (zero-copy if possible) support for PyArrow arrays as input to avoid the unnecessary PyArrow to Python/NumPy conversion (pretty slow for string arrays) would be nice.
PS: PyArrow has recently added support for the PyCapsule interface, which should help with the implementation.