Export as Arrow - Githubissues

rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

MIT License

3.75k stars 341 forks source link

Export as Arrow #234

Open lhoestq opened 1 year ago

lhoestq commented 1 year ago

Hi ! Hugging Face Datasets uses Arrow for image datasets, which can be loaded as map-style datasets or iterable datasets. I ran some benchmarks on ImageNet-1k with webdataset and got same throughput on my laptop and 10x faster on gcp. It's also pretty easy to preprocess, and shuffle and transform on-the-fly during training even on multi node. Let me know if you think it would make sense to have an Arrow export :)

rom1504 commented 1 year ago

Hi, I'm definitely curious how would arrow datasets perform at large scale (say more than 100M images) Definitely interested by a PR!

Should be possible to test it by downloading laion400m and reading it, it takes only one day.

Btw, do you think webdataset support in hf datasets would also be an option? torchdata natively supports it now

lhoestq commented 1 year ago

Hi, I'm definitely curious how would arrow datasets perform at large scale (say more than 100M images) Definitely interested by a PR!

Should be possible to test it by downloading laion400m and reading it, it takes only one day.

Let's give it a try :)

Btw, do you think webdataset support in hf datasets would also be an option? torchdata natively supports it now

Yup definitely, opened an issue at https://github.com/huggingface/datasets/issues/5337