Open puzzler10 opened 2 years ago
Hi @puzzler10, thanks for reporting.
Please note this dataset is not hosted on Hugging Face Hub. See: https://github.com/huggingface/datasets/blob/c8f914473b041833fd47178fa4373cdcb56ac522/datasets/rotten_tomatoes/rotten_tomatoes.py#L42
If there are issues with the source data of a dataset, you should contact the data owners/creators instead. In the homepage associated with this dataset (http://www.cs.cornell.edu/people/pabo/movie-review-data/), you can find the authors of the dataset and how to contact them:
If you have any questions or comments regarding this site, please send email to Bo Pang or Lillian Lee.
P.S.: Please also note that the example you gave of non-English review is in Portuguese (not Spanish). ;)
Maybe best to just put a quick sentence in the dataset description that highlights this?
Describe the bug
See title. I don't think this is intentional and they probably should be removed. If they stay the dataset description should be at least updated to make it clear to the user.
Steps to reproduce the bug
Go to the dataset viewer for the dataset, set the offset to 4160 for the train dataset, and scroll through the results. I found ones at index 4166 and 4173. There's others too (e.g. index 2888) but those two are easy to find like that.
Expected results
English movie reviews only.
Actual results
Example of a Spanish movie review (4173):