huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.72k stars 2.58k forks source link

[Dataset requests] New datasets for Text Classification #353

Open thomwolf opened 3 years ago

thomwolf commented 3 years ago

We are missing a few datasets for Text Classification which is an important field.

Namely, it would be really nice to add:

All these datasets are cited in https://arxiv.org/abs/2004.03705

thomwolf commented 3 years ago

Pinging @mariamabarham as well

jxmorris12 commented 3 years ago

I'd also like to see:

mariamabarham commented 3 years ago

Thanks @jxmorris12 for pointing this out.

In glue we only have SST-2 maybe we can add separately SST-1.

jxmorris12 commented 3 years ago

This is the homepage for the Amazon dataset: https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products

Is there an easy way to download kaggle datasets programmatically? If so, I can add this one!

mariamabarham commented 3 years ago

Hi @jxmorris12 for now I think our dl_manager does not download from Kaggle. @thomwolf , @lhoestq

ghomasHudson commented 3 years ago

Pretty sure the quora dataset is the same one I implemented here: https://github.com/huggingface/nlp/pull/366

moscow25 commented 3 years ago

Great list. Any idea if Amazon Reviews has been added?

Apologies if it's been included (great to see where) and if not, it's one of the better medium/large NLP dataset for semi-supervised learning, albeit a bit out of date.

Thanks!!

cc @sshleifer

moscow25 commented 3 years ago

On the Amazon Reviews dataset, the original UCSD website has noted these are now updated to include product reviews through 2018 -- actually quite recent compared to many other datasets. Almost certainly the largest NLP dataset out there with labels! https://jmcauley.ucsd.edu/data/amazon/

Any chance someone has time to onboard this dataset in a HF way?

cc @sshleifer

johneckberg commented 4 months ago

@albertvillanova How up to date is this issue? I see that some of these datasets are now on huggingface but have not been checked off the list