tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.29k stars 1.54k forks source link

[GSoC] Update fake data for caltech_birds2011 & other datasets #1792

Closed Conchylicultor closed 4 years ago

Conchylicultor commented 4 years ago

The fake data for caltech_birds2011 is way to big (> 100MB). We should investigate where does this huge size comes from and try to reduce it.

Fake data is at https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/testing/test_data/fake_examples/caltech_birds2011 The test is at: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/caltech_birds_test.py

Among the other datasets which have huge fake data size are:

Those datasets take more than 70% of all fake data size. caltech_birds2011 is almost half of it. Reducing the size of those fake data would have a huge impact on our github repository size.

Eshan-Agarwal commented 4 years ago

@Conchylicultor I would like to take this issue, working on it

vijayphoenix commented 4 years ago

There are some unnecessary files in caltech_birds2011 dataset

Eshan-Agarwal commented 4 years ago

@vijayphoenix Don't understand why it is unnecessary, it is present in CUB_200_2011.tar.gz

vijayphoenix commented 4 years ago

@Eshan-Agarwal the files were downloaded but never used for dataset generation

acharles7 commented 4 years ago

What's left in this issue?