pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.52k stars 811 forks source link

wikitext-2 is not available anymore #2247

Open huangjia2019 opened 8 months ago

huangjia2019 commented 8 months ago

🐛 Bug

Describe the bug

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip This exception is thrown by iter of HTTPReaderIterDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)

To Reproduce Steps to reproduce the behavior:

from torchtext.datasets import WikiText2 from torchtext.data.utils import get_tokenizer from torchtext.vocab import build_vocab_from_iterator from torch.utils.data import DataLoader, Dataset

tokenizer = get_tokenizer("basic_english")

train_iter = WikiText2(split='train') valid_iter = WikiText2(split='valid')

def yield_tokens(data_iter): for item in data_iter: yield tokenizer(item)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["", "", ""]) vocab.set_default_index(vocab[""])

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment

Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip

Additional context Add any other context about the problem here.

leedrake5 commented 5 months ago

Is there an alternate link we can get? The documentation here says:

import os
from functools import partial
from typing import Union, Tuple

from torchtext._internal.module_utils import is_module_available
from torchtext.data.datasets_utils import (
    _wrap_split_argument,
    _create_dataset_directory,
)

URL = "https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip"

MD5 = "542ccefacc6c27f945fb54453812b3cd"

... can we just find an alternate URL and change the function?

WenqiangZhang003 commented 4 months ago

Hi team, about this error, is there any solution now? We also encountered the same error.

WangX0111 commented 2 months ago

how can we change the URL

ihainan commented 1 day ago

I uploaded the wikitext-2-v1.zip file to my server and changed the source code lines in the lib/python3.10/site-packages/torchtext/datasets/wikitext2.py file from:

URL = "https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip"
MD5 = "542ccefacc6c27f945fb54453812b3cd"

to

URL = "http://la.ihainan.me/wikitext-2-v1.zip"
MD5 = "f6e734fc17885b364243f67b30385a3d"

to temporarily solve this issue.