huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.99k stars 2.63k forks source link

Error when downloading a large dataset on slow connection. #1706

Open lucadiliello opened 3 years ago

lucadiliello commented 3 years ago

I receive the following error after about an hour trying to download the openwebtext dataset.

The code used is:

import datasets
datasets.load_dataset("openwebtext")

Traceback (most recent call last): [4/28] File "", line 1, in File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/site-packages/datasets/load.py", line 610, in load_dataset ignore_verifications=ignore_verifications, File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/site-packages/datasets/builder.py", line 515, in download_and_prepare dl_manager=dl_manager, verify_infos=verify_infos, download_and_prepare_kwargs File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/site-packages/datasets/builder.py", line 570, in _download_and_prepare split_generators = self._split_generators(dl_manager, split_generators_kwargs) File "/home/lucadiliello/.cache/huggingface/modules/datasets_modules/datasets/openwebtext/5c636399c7155da97c982d0d70ecdce30fbca66a4eb4fc768ad91f8331edac02/openwebtext.py", line 62, in _split_generators dl_dir = dl_manager.download_and_extract(_URL) File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/site-packages/datasets/utils/download_manager.py", line 254, in download_and_extract return self.extract(self.download(url_or_urls)) File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/site-packages/datasets/utils/download_manager.py", line 235, in extract num_proc=num_proc, File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/site-packages/datasets/utils/py_utils.py", line 225, in map_nested return function(data_struct) File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/site-packages/datasets/utils/file_utils.py", line 343, in cached_path tar_file.extractall(output_path_extracted) File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/tarfile.py", line 2000, in extractall numeric_owner=numeric_owner) File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/tarfile.py", line 2042, in extract numeric_owner=numeric_owner) File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/tarfile.py", line 2112, in _extract_member self.makefile(tarinfo, targetpath) File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/tarfile.py", line 2161, in makefile copyfileobj(source, target, tarinfo.size, ReadError, bufsize) File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/tarfile.py", line 253, in copyfileobj buf = src.read(remainder) File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/lzma.py", line 200, in read return self._buffer.read(size) File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/_compression.py", line 68, in readinto data = self.read(len(byte_view)) File "/home/lucadiliello/anaconda3/envs/nlp/lib/python3.7/_compression.py", line 99, in read raise EOFError("Compressed file ended before the " EOFError: Compressed file ended before the end-of-stream marker was reached

lhoestq commented 3 years ago

Hi ! Is this an issue you have with openwebtext specifically or also with other datasets ?

It looks like the downloaded file is corrupted and can't be extracted using tarfile. Could you try loading it again with

import datasets
datasets.load_dataset("openwebtext", download_mode="force_redownload")