pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.49k stars 813 forks source link

404 Client Error in IWSLT2017 and IWSLT2016 #2189

Closed nonconvexopt closed 6 months ago

nonconvexopt commented 1 year ago

🐛 Bug

from torchtext.vocab import build_vocab_from_iterator

special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

def yield_tokens(data_iter: Iterable) -> List[str]:
    for data_sample in data_iter:
        yield token_transform(data_sample)

train_iter = IWSLT2017(split="train")
build_vocab_from_iterator(yield_tokens(train_iter), min_freq=1, specials=special_symbols, special_first=True)

When I run above code, it shows below error: requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://drive.google.com/uc?id=12ycYSzLIG253AFN35Y6qoyf9wtkOjakp

I think some download link for the data is broken.

rsonthal commented 8 months ago

I just encountered this error on Google Colab.

!pip install 'portalocker'

import torch import torch.nn as nn import numpy as np import matplotlib.pyplot as plt from tqdm import tqdm

import torchtext

dataset = torchtext.datasets.IWSLT2017(root='.data', split='train', language_pair=('de', 'en'))

next(iter(dataset))

HTTPError: 404 Client Error: Not Found for url: https://drive.google.com/uc?id=12ycYSzLIG253AFN35Y6qoyf9wtkOjakp This exception is thrown by iter of GDriveReaderDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)