huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.05k stars 2.64k forks source link

Add Hateful Memes Dataset #1810

Open gchhablani opened 3 years ago

gchhablani commented 3 years ago

Add Hateful Memes Dataset

I will be adding this dataset. It requires the user to sign an agreement on DrivenData. So, it will be used with a manual download.

The issue with this dataset is that the images are of different sizes. The image datasets added so far (CIFAR-10 and MNIST) have a uniform shape throughout. So something like

 datasets.Array2D(shape=(28, 28), dtype="uint8")

won't work for the images. How would I add image features then? I checked datasets/features.py but couldn't figure out the appropriate class for this. I'm assuming I would want to avoid re-sizing at all since we want the user to be able to access the original images.

Also, in case I want to load only a subset of the data, since the actual data is around 8.8GB, how would that be possible?

Thanks, Gunjan

gchhablani commented 3 years ago

I am not sure, but would datasets.Sequence(datasets.Sequence(datasets.Sequence(datasets.Value("int"))) work?

gchhablani commented 3 years ago

Also, I found the information for loading only subsets of the data here.

gchhablani commented 3 years ago

Hi @lhoestq,

Request you to check this once.

Thanks, Gunjan

lhoestq commented 3 years ago

Hi @gchhablani since Array2D doesn't support images of different sizes, I would suggest to store in the dataset the paths to the image file instead of the image data. This has the advantage of not decompressing the data (images are often compressed using jpeg, png etc.). Users can still apply .map to load the images if they want to. Though it would en up being Sequences features.

In the future we'll add support for ragged tensors for this case and update the relevant dataset with this feature.