pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.5k stars 815 forks source link

Getting error while downloading unsupervised learning dataset: EnWik9 #894

Closed vaibhav-malpani closed 11 months ago

vaibhav-malpani commented 4 years ago

🐛 Bug

Describe the bug Whenever i try to download unsupervised learning dataset: EnWik9 i get error as shown below. I tried it 3 times and it failed with the same error every time.


AssertionError Traceback (most recent call last)

in 6 if not os.path.isdir('./.data'): 7 os.mkdir('./.data') ----> 8 train_dataset, test_dataset = EnWik9(num_lines=20000)( 9 root='./.data', ngrams=NGRAMS, vocab=None) 10 BATCH_SIZE = 16 c:\users\vaibhav\appdata\local\programs\python\python37\lib\site-packages\torchtext\datasets\unsupervised_learning.py in __init__(self, begin_line, num_lines, root) 106 path=os.path.join(root, 'enwik9.zip'), 107 root=root) --> 108 extracted_file = extract_archive(dataset_zip) 109 raw_file = os.path.join(root, extracted_file[0]) 110 preprocess_raw_enwik9(raw_file, processed_file) c:\users\vaibhav\appdata\local\programs\python\python37\lib\site-packages\torchtext\utils.py in extract_archive(from_path, to_path, overwrite) 188 189 elif from_path.endswith('.zip'): --> 190 assert zipfile.is_zipfile(from_path), from_path 191 logging.info('Opening zip file {}.'.format(from_path)) 192 with zipfile.ZipFile(from_path, 'r') as zfile: AssertionError: .data\enwik9.zip **To Reproduce** Steps to reproduce the behavior: from torchtext.datasets import EnWik9 enwik9 = EnWik9(num_lines=20000) vocab = enwik9.get_vocab() **Expected behavior** The dataset should have downloaded successfully **Environment** - PyTorch Version (e.g., 1.4): - OS (e.g., windows): - How you installed PyTorch (`pip`): - Build command you used (if compiling from source): - Python version: 3.7.0 - CUDA/cuDNN version: - GPU models and configuration: - Any other relevant information:
zhangguanheng66 commented 4 years ago

I cannot reproduce the error on my side. It seems something related to extract_archive func. Could you try this code and see if it can unzip the file successfully?

from torchtext.utils import extract_archive
extracted_file = extract_archive('.data/enwik9.zip')