huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.18k stars 2.67k forks source link

NonMatchingChecksumError for downloading conll2012_ontonotesv5 dataset #4080

Closed richarddwang closed 2 years ago

richarddwang commented 2 years ago

Steps to reproduce the bug

datasets.load_dataset("conll2012_ontonotesv5",  "english_v12")

Actual results

Downloading builder script: 32.2kB [00:00, 9.72MB/s]                                                                                          
Downloading metadata: 20.0kB [00:00, 10.4MB/s]                                                                                                
Downloading and preparing dataset conll2012_ontonotesv5/english_v12 (download: 174.83 MiB, generated: 204.29 MiB, post-processed: Unknown size
, total: 379.12 MiB) to ...
Traceback (most recent call last):                                                                                                   [315/390]
  File "/home/yisiang/lgtn/conll2012/run.py", line 86, in <module>                                                                            
    train()                                                                                                                                   
  File "/home/yisiang/lgtn/conll2012/run.py", line 65, in train                                                                               
    trainer.fit(model, datamodule=dm)                                                                                                         
  File "/home/yisiang/miniconda3/envs/ai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit                  
    self._call_and_handle_interrupt(                                                                                                          
  File "/home/yisiang/miniconda3/envs/ai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_inte
rrupt                                                                                                                                         
    return trainer_fn(*args, **kwargs)                                                                                                        
  File "/home/yisiang/miniconda3/envs/ai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl            
    self._run(model, ckpt_path=ckpt_path)                                                                                                     
  File "/home/yisiang/miniconda3/envs/ai/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1131, in _run                
    self._data_connector.prepare_data()                                                                                                       
  File "/home/yisiang/miniconda3/envs/ai/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 154, in pre
pare_data                                                                                                                                     
    self.trainer.datamodule.prepare_data()                                                                                                    
  File "/home/yisiang/miniconda3/envs/ai/lib/python3.9/site-packages/pytorch_lightning/core/datamodule.py", line 474, in wrapped_fn           
    fn(*args, **kwargs)                                                                                                                       
  File "/home/yisiang/lgtn/_abstract_task/data.py", line 43, in prepare_data                                                                  
    raw_dsets = datasets.load_dataset(**load_dataset_kwargs)                                                                                  
  File "/home/yisiang/miniconda3/envs/ai/lib/python3.9/site-packages/datasets/load.py", line 1687, in load_dataset                            
    builder_instance.download_and_prepare(                                                                                                    
  File "/home/yisiang/miniconda3/envs/ai/lib/python3.9/site-packages/datasets/builder.py", line 605, in download_and_prepare                  
    self._download_and_prepare(                                                                                                               
  File "/home/yisiang/miniconda3/envs/ai/lib/python3.9/site-packages/datasets/builder.py", line 1104, in _download_and_prepare                
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)                                                
  File "/home/yisiang/miniconda3/envs/ai/lib/python3.9/site-packages/datasets/builder.py", line 676, in _download_and_prepare                 
    verify_checksums(                                                                                                                         
  File "/home/yisiang/miniconda3/envs/ai/lib/python3.9/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums              
    raise NonMatchingChecksumError(error_msg + str(bad_urls))                                                                                 
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:                                          
['https://md-datasets-cache-zipfiles-prod.s3.eu-west-1.amazonaws.com/zmycy7t9h9-1.zip']

Environment info

albertvillanova commented 2 years ago

Hi @richarddwang,

Indeed, we have recently updated the loading script of that dataset (and fixed that bug as well):

That fix will be available in our next datasets library release. In the meantime, you can incorporate that fix by:

Feel free to re-open this issue if the problem persists.

Duplicate of: