huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.02k stars 2.63k forks source link

The rotten_tomatoes dataset of movie reviews contains some reviews in Spanish #3475

Open puzzler10 opened 2 years ago

puzzler10 commented 2 years ago

Describe the bug

See title. I don't think this is intentional and they probably should be removed. If they stay the dataset description should be at least updated to make it clear to the user.

Steps to reproduce the bug

Go to the dataset viewer for the dataset, set the offset to 4160 for the train dataset, and scroll through the results. I found ones at index 4166 and 4173. There's others too (e.g. index 2888) but those two are easy to find like that.

Expected results

English movie reviews only.

Actual results

Example of a Spanish movie review (4173):

"É uma pena que , mais tarde , o próprio filme abandone o tom de paródia e passe a utilizar os mesmos clichês que havia satirizado "

albertvillanova commented 2 years ago

Hi @puzzler10, thanks for reporting.

Please note this dataset is not hosted on Hugging Face Hub. See: https://github.com/huggingface/datasets/blob/c8f914473b041833fd47178fa4373cdcb56ac522/datasets/rotten_tomatoes/rotten_tomatoes.py#L42

If there are issues with the source data of a dataset, you should contact the data owners/creators instead. In the homepage associated with this dataset (http://www.cs.cornell.edu/people/pabo/movie-review-data/), you can find the authors of the dataset and how to contact them:

If you have any questions or comments regarding this site, please send email to Bo Pang or Lillian Lee.

P.S.: Please also note that the example you gave of non-English review is in Portuguese (not Spanish). ;)

puzzler10 commented 2 years ago

Maybe best to just put a quick sentence in the dataset description that highlights this?