tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.3k stars 1.54k forks source link

Wikipedia Dataset Is Broken #2535

Open mahrahimi1 opened 4 years ago

mahrahimi1 commented 4 years ago

When I try to download the English Wikipedia dataset using:

python -m tensorflow_datasets.scripts.download_and_prepare --datasets=wikipedia/20200301.en

I get the error:

tensorflow_datasets.core.download.downloader.DownloadError: 
Failed to get url https://dumps.wikimedia.your.org/enwiki/20200301/dumpstatus.json. HTTP code: 404.

I would like to download it locally (not on GCS).

Environment information Operating System: Windows 10 Python version: 3.8.3 tensorflow-datasets: 3.2.1 tensorflow version: 2.3.1

Conchylicultor commented 4 years ago

It is because wikipedia files are only hosted for a a few months. So the 20200301 files have been deleted.

Note: Rather than regenerating the data, You could manually download the pre-generated dataset directly from: https://console.cloud.google.com/storage/browser/tfds-data/datasets/wikipedia using gcs_util CLI.

I believe you could also try with tfds-nightly, it should copy the dataset locally from GCS automatically. As we re-enabled GCS on windows when supported.

ameet-1997 commented 4 years ago

Is there some way to download the zip files from Wikipedia dumps and use the TensorFlow code only to preprocess the files? Would it be possible to add that functionality?

Conchylicultor commented 4 years ago

Why downloading the original files when the dataset is already processed in: https://console.cloud.google.com/storage/browser/tfds-data/datasets/wikipedia ?

If you want the new dumps version, you could also load wikipedia with a custom config: https://github.com/tensorflow/datasets/blob/0c308e4b7be9d0c834d0021c6cf6566f6cd57c00/tensorflow_datasets/text/wikipedia.py#L126

ameet-1997 commented 4 years ago

Thanks for the cloud console solution!

As for changing the dumps version, that solution does not work because tfds restricts you to use one of the versions which is present on Google Cloud, in this case 20200301 or 20190301.

That's why my question was asking if it would be possible to add that functionality!

Conchylicultor commented 4 years ago

As for changing the dumps version, that solution does not work because tfds restricts you to use one of the versions which is present on Google Cloud, in this case 20200301 or 20190301.

I was talking about generating wikipedia at a new version: tfds.load('wikipedia', builder_kwargs={'builder_config': tfds.text.WikipediaConfig(...)}) But using the current GCS version is definitely easier.