Open mahrahimi1 opened 4 years ago
It is because wikipedia files are only hosted for a a few months. So the 20200301 files have been deleted.
Note: Rather than regenerating the data, You could manually download the pre-generated dataset directly from: https://console.cloud.google.com/storage/browser/tfds-data/datasets/wikipedia using gcs_util CLI.
I believe you could also try with tfds-nightly
, it should copy the dataset locally from GCS automatically. As we re-enabled GCS on windows when supported.
Is there some way to download the zip files from Wikipedia dumps and use the TensorFlow code only to preprocess the files? Would it be possible to add that functionality?
Why downloading the original files when the dataset is already processed in: https://console.cloud.google.com/storage/browser/tfds-data/datasets/wikipedia ?
If you want the new dumps version, you could also load wikipedia with a custom config: https://github.com/tensorflow/datasets/blob/0c308e4b7be9d0c834d0021c6cf6566f6cd57c00/tensorflow_datasets/text/wikipedia.py#L126
Thanks for the cloud console solution!
As for changing the dumps version, that solution does not work because tfds
restricts you to use one of the versions which is present on Google Cloud, in this case 20200301
or 20190301
.
That's why my question was asking if it would be possible to add that functionality!
As for changing the dumps version, that solution does not work because tfds restricts you to use one of the versions which is present on Google Cloud, in this case 20200301 or 20190301.
I was talking about generating wikipedia at a new version: tfds.load('wikipedia', builder_kwargs={'builder_config': tfds.text.WikipediaConfig(...)})
But using the current GCS version is definitely easier.
When I try to download the English Wikipedia dataset using:
I get the error:
I would like to download it locally (not on GCS).
Environment information Operating System: Windows 10 Python version: 3.8.3 tensorflow-datasets: 3.2.1 tensorflow version: 2.3.1