huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.17k stars 2.67k forks source link

Enable Fast Filtering using Arrow Dataset #1949

Open gchhablani opened 3 years ago

gchhablani commented 3 years ago

Hi @lhoestq,

As mentioned in Issue #1796, I would love to work on enabling fast filtering/mapping. Can you please share the expectations? It would be great if you could point me to the relevant methods/files involved. Or the docs or maybe an overview of arrow_dataset.py. I only ask this because I am having trouble getting started ;-;

Any help would be appreciated.

Thanks, Gunjan

lhoestq commented 3 years ago

Hi @gchhablani :) Thanks for proposing your help !

I'll be doing a refactor of some parts related to filtering in the scope of https://github.com/huggingface/datasets/issues/1877 So I would first wait for this refactor to be done before working on the filtering. In particular because I plan to make things simpler to manipulate.

Your feedback on this refactor would also be appreciated since it also aims at making the core code more accessible (basically my goal is that no one's ever "having troubles getting started" ^^)

This will be available in a few days, I will be able to give you more details at that time if you don't mind waiting a bit !

gchhablani commented 3 years ago

Sure! I don't mind waiting. I'll check the refactor and try to understand what you're trying to do :)