pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.48k stars 816 forks source link

Wikitext-103 URL is down #2255

Open albertz opened 3 months ago

albertz commented 3 months ago

https://github.com/pytorch/text/blob/4bf6b30314649801ecc28888aa54acea8d0f4d99/torchtext/datasets/wikitext103.py#L11

All links to https://s3.amazonaws.com/research.metamind.io are not working anymore. I get "Access Denied".

albertz commented 3 months ago

For reference, one copy I found is via pardata: https://github.com/CODAIT/pardata/blob/1d1600ad3eed6894da7dbddc451cd38aa03c770c/tests/schemata/datasets.yaml#L42C21-L42C99 But it's not exactly the same file (tar.gz instead of zip), but it looks like it has the same content (the files: LICENSE.txt README.txt wiki.test.tokens wiki.train.tokens wiki.valid.tokens).

Another copy of the data is on HuggingFace in various forms, for example: https://huggingface.co/datasets/wikitext

codes1gn commented 2 months ago

Hi Albertz, I faced exactly same issue on torchtext 0.17.2. Have you got a neat solution to this issue? I found datasets from other sources may need adaption 1by1.

albertz commented 2 months ago

I did not found the zip files anywhere. But I was using the tar.gz files instead which I linked above, which seem to contain the same content.