pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.49k stars 813 forks source link

torchtext.datasets - requests.exceptions.ConnectionError #2196

Open afurkank opened 1 year ago

afurkank commented 1 year ago

🐛 Bug

Description of the bug

When I try to use Multi30k dataset, I get this error:

requests.exceptions.ConnectionError:
This exception is thrown by __iter__ of HTTPReaderIterDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)

To Reproduce

from torchtext.datasets import Multi30k

SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))

next(iter(train_iter))

Expected behavior

Return a proper iterable where I can iterate over the dataset.

Environment

PyTorch version: 1.13.1+cpu Is debug build: False CUDA used to build PyTorch: Could not collect ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A

Python version: 3.10.9 | packaged by Anaconda, Inc. | (main, Mar 1 2023, 18:18:15) [MSC v.1916 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.22621-SP0 Is CUDA available: False CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: GeForce GTX 1650 Nvidia driver version: 442.23 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture=9 CurrentClockSpeed=2592 DeviceID=CPU0 Family=198 L2CacheSize=1536 L2CacheSpeed= Manufacturer=GenuineIntel MaxClockSpeed=2592 Name=Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz ProcessorType=3 Revision=

Versions of relevant libraries: [pip3] flake8==6.0.0 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.5 [pip3] numpydoc==1.5.0 [pip3] torch==1.13.1 [pip3] torchdata==0.5.1 [pip3] torchtext==0.14.1 [conda] Could not collect

Additional context

I've been running into issues with the Multi30K dataset for some time now. The issue that was occurring before was resolved by installing specific versions and combinations of the relevant torch libraries I specified. However, even this solution doesn't work anymore. Can you please fix what's broken with this cursed dataset?

Thank you.

afurkank commented 1 year ago

I also tried this:

from torchtext.datasets import Multi30k
from torch.utils.data import DataLoader

datapipe = Multi30k(split='train', language_pair=('de', 'en'))

loader = DataLoader(datapipe, drop_last=True, shuffle=False)

next(iter(loader))

Now I get a different error:

Exception: Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.
This exception is thrown by __iter__ of HTTPReaderIterDataPipe(source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)

Environment is the same. Same error occurs with DataLoader2 as well.

Yancy456 commented 7 months ago

network error. Check Internet settings