huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

Cannot load ‘bookcorpusopen’ #3561

Closed HUIYINXUE closed 2 years ago

HUIYINXUE commented 2 years ago

Describe the bug

Cannot load 'bookcorpusopen'

Steps to reproduce the bug

dataset = load_dataset('bookcorpusopen')

or

dataset = load_dataset('bookcorpusopen',script_version='master')

Actual results

ConnectionError: Couldn't reach https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz

Environment info

samjgorman commented 2 years ago

The host of this copy of the dataset (https://the-eye.eu) is down and has been down for a good amount of time (potentially months)

Finding this dataset is a little esoteric, as the original authors took down the official BookCorpus dataset some time ago.

There are community-created versions of BookCorpus, such as the files hosted in the link below. https://battle.shawwn.com/sdb/bookcorpus/

And more discussion here: https://github.com/soskek/bookcorpus

Do we want to remove this dataset entirely? There's a fair argument for this, given that the official BookCorpus dataset was taken down by the authors. If not, perhaps can open a PR with the link to the community-created tar above and updated dataset description.

mariosasko commented 2 years ago

Hi! The bookcorpusopen dataset is not working for the same reason as explained in this comment: https://github.com/huggingface/datasets/issues/3504#issuecomment-1004564980

albertvillanova commented 2 years ago

Hi @HUIYINXUE, it should work now that the data owners created a mirror server with all data, and we updated the URL in our library.