tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.29k stars 1.54k forks source link

Dataset documentation should clearly mark which ones require manual download #936

Closed cyfra closed 4 years ago

cyfra commented 5 years ago

Some of the datasets require manual downloading of files. This should be clearly marked in the dataset documentation.

Preferably it should be detected 'automatically' - by seeing which datasets depend on manual_dir.

ShambhaviCodes commented 5 years ago

Some of the datasets require manual downloading of files. This should be clearly marked in the dataset documentation.

Preferably it should be detected 'automatically' - by seeing which datasets depend on manual_dir.

I want to work on this issue. Can you please guide how to go ahead with it?

Conchylicultor commented 5 years ago

@ShambhaviCodes Thanks for looking into this.

manual dir is accessed through: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/download/download_manager.py#L372 After download_and_prepare, the dataset builder should looks if this field has been called, indicating that the dataset is using manual data: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builder.py#L285 Then this information should be saved inside dataset_info (is some new field self.info.use_manual_data), so the DatasetInfo class should be updated: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_info.py Note that you also want to update the associated proto: https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/core/proto Finally, the dataset template should be updated to use this new builder.info.use_manual_data field: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/scripts/templates/dataset.mako.md

Let me know if you encounter any issues.

ShambhaviCodes commented 5 years ago

I was trying to re-create this issue. I have successfully managed to clone, extract and run the code to reproduce the (mnist) dataset. Can you suggest a dataset that requires manual downloading of files?

Conchylicultor commented 5 years ago

You can find which datasets are using manual_dir by searching on the code: https://github.com/tensorflow/datasets/search?p=2&q=dl_manager.manual_dir&unscoped_q=dl_manager.manual_dir

For instance: abstract_reasoning, chexpert, xsum,...

cyfra commented 4 years ago

Fixed in #1227