tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
706 stars 287 forks source link

Prevent replicated functionality? #372

Closed damienpontifex closed 5 years ago

damienpontifex commented 5 years ago

How does this repo decide what's in and what's better elsewhere in the TensorFlow ecosystem?

I ask this as I noticed the readme has a guide using

import tensorflow_io.mnist as mnist_io
# Read MNIST into tf.data.Dataset
d_train = mnist_io.MNISTDataset(
    'train-images-idx3-ubyte.gz',
    'train-labels-idx1-ubyte.gz',
    batch=1)

but tensorflow/datasets also has an mnist dataset.

Should the dataset data not live in here and functionality in tensorflow/io be focused on "collection of file systems and file formats"?

yongtang commented 5 years ago

@damienpontifex there might be some levels of overlap, though the dataset we work on here, is more about the subclass of tf.data.Dataset and its C++ implementations for working on MNIST data format, while tensorflow-datasets 's dataset refers to the data packages that could be downloaded and consumed directly.

For example, in our case, MNIST is not just referring to the gzip files that could be downloaded from http://yann.lecun.com/exdb/mnist/, the MNIST itself is a legitimate file format.

As was mentioned in PR #111, MNIST format was used by Fashion-MNIST, Kuzushiji-MNIST, EMNIST. It is also used by people who want to generate MNIST format so that they could reused the same data pipeline they already tested.

In our readme, we don't provide a way to automatically download the MNIST data (unlike Tensorflow-datasets). It is up to the user to have a file that is in MNIST format, then they could use MNISTDataset which is a subclass of tf.data.Dataset and could be saved in graph.

damienpontifex commented 5 years ago

Thanks for the insights @yongtang