Open lhoestq opened 1 year ago
Hi, I'm definitely curious how would arrow datasets perform at large scale (say more than 100M images) Definitely interested by a PR!
Should be possible to test it by downloading laion400m and reading it, it takes only one day.
Btw, do you think webdataset support in hf datasets would also be an option? torchdata natively supports it now
Hi, I'm definitely curious how would arrow datasets perform at large scale (say more than 100M images) Definitely interested by a PR!
Should be possible to test it by downloading laion400m and reading it, it takes only one day.
Let's give it a try :)
Btw, do you think webdataset support in hf datasets would also be an option? torchdata natively supports it now
Yup definitely, opened an issue at https://github.com/huggingface/datasets/issues/5337
Hi ! Hugging Face Datasets uses Arrow for image datasets, which can be loaded as map-style datasets or iterable datasets. I ran some benchmarks on ImageNet-1k with webdataset and got same throughput on my laptop and 10x faster on gcp. It's also pretty easy to preprocess, and shuffle and transform on-the-fly during training even on multi node. Let me know if you think it would make sense to have an Arrow export :)