huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.25k stars 2.69k forks source link

"Checksums didn't match for dataset source files" error while loading openwebtext dataset #726

Closed SparkJiao closed 2 years ago

SparkJiao commented 4 years ago

Hi, I have encountered this problem during loading the openwebtext dataset:

>>> dataset = load_dataset('openwebtext')
Downloading and preparing dataset openwebtext/plain_text (download: 12.00 GiB, generated: 37.04 GiB, post-processed: Unknown size, total: 49.03 GiB) to /home/admin/.cache/huggingface/datasets/openwebtext/plain_text/1.0.0/5c636399c7155da97c982d0d70ecdce30fbca66a4eb4fc768ad91f8331edac02...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/load.py", line 611, in load_dataset
    ignore_verifications=ignore_verifications,
  File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/builder.py", line 476, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/builder.py", line 536, in _download_and_prepare
    self.info.download_checksums, dl_manager.get_recorded_sizes_checksums(), "dataset source files"
  File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/utils/info_utils.py", line 39, in verify_checksums
    raise NonMatchingChecksumError(error_msg + str(bad_urls))
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://zenodo.org/record/3834942/files/openwebtext.tar.xz']

I think this problem is caused because the released dataset has changed. Or I should download the dataset manually?

Sorry for release the unfinised issue by mistake.

thomwolf commented 4 years ago

Hi try, to provide more information please.

Example code in a colab to reproduce the error, details on what you are trying to do and what you were expected and details on your environment (OS, PyPi packages version).

SparkJiao commented 4 years ago

Hi try, to provide more information please.

Example code in a colab to reproduce the error, details on what you are trying to do and what you were expected and details on your environment (OS, PyPi packages version).

I have update the description, sorry for the incomplete issue by mistake.

SparkJiao commented 4 years ago

Hi, I have manually downloaded the compressed dataset `openwebtext.tar.xz' and use the following command to preprocess the examples:

>>> dataset = load_dataset('/home/admin/workspace/datasets/datasets-master/datasets-master/datasets/openwebtext', data_dir='/home/admin/workspace/datasets')
Using custom data configuration default
Downloading and preparing dataset openwebtext/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/admin/.cache/huggingface/datasets/openwebtext/default/0.0.0/5c636399c7155da97c982d0d70ecdce30fbca66a4eb4fc768ad91f8331edac02...
Dataset openwebtext downloaded and prepared to /home/admin/.cache/huggingface/datasets/openwebtext/default/0.0.0/5c636399c7155da97c982d0d70ecdce30fbca66a4eb4fc768ad91f8331edac02. Subsequent calls will reuse this data.
>>> len(dataset['train'])
74571
>>>

The size of the pre-processed example file is only 354MB, however the processed bookcorpus dataset is 4.6g. Are there any problems?

Muhammadharun786 commented 3 years ago

NonMatchingChecksumError: Checksums didn't match for dataset source files:

i got this issue when i try to work on my own datasets kindly tell me, from where i can get checksums of train and dev file in my github repo

tanvidadu commented 3 years ago

Hi, I got the similar issue for xnli dataset while working on colab with python3.7.

nlp.load_dataset(path = 'xnli')

The above command resulted in following issue :

NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip']

Any idea how to fix this ?

RylanSchaeffer commented 3 years ago

Did anyone figure out how to fix this error?

albertvillanova commented 2 years ago

Fixed by:

OtwellResearch commented 2 years ago

Says fixed but I'm still getting it.

command:

dataset = load_dataset("ted_talks_iwslt", language_pair=("en", "es"), year="2014",download_mode="force_redownload")

got:

Using custom data configuration en_es_2014-35a2d3350a0f9823 Downloading and preparing dataset ted_talks_iwslt/en_es_2014 (download: 2.15 KiB, generated: Unknown size, post-processed: Unknown size, total: 2.15 KiB) to /home/ken/.cache/huggingface/datasets/ted_talks_iwslt/en_es_2014-35a2d3350a0f9823/1.1.0/43935b3fe470c753a023642e1f54b068c590847f9928bd3f2ec99f15702ad6a6... Downloading: 2.21k/? [00:00<00:00, 141kB/s]

NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://drive.google.com/u/0/uc?id=1Cz1Un9p8Xn9IpEMMrg2kXSDt0dnjxc4z&export=download']