tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.31k stars 1.55k forks source link

Loading the iris dataset via tfds.load does not return a tf.data.Dataset object #3821

Open dluo96 opened 2 years ago

dluo96 commented 2 years ago

Short description When I load the iris dataset (https://www.tensorflow.org/datasets/catalog/iris) using the tfds.load function, the returned object is not a tf.data.Dataset object (which should be the case according to https://www.tensorflow.org/datasets/overview#tfdsload).

Environment information

Reproduction instructions

import tensorflow as tf
import tensorflow_datasets as tfds

ds_train = tfds.load(
    'iris', 
    shuffle_files=True,
    split=['train'],
    as_supervised=True,
)

assert isinstance(ds_train, tf.data.Dataset)

Link to logs 2022-03-05 13:27:35.574589: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. Traceback (most recent call last): File "structured/iris.py", line 12, in assert isinstance(ds_train, tf.data.Dataset) AssertionError

Expected behavior I expect assert isinstance(ds_train, tf.data.Dataset) to pass without AssertionError.

Additional context N/A.

tomvdw commented 2 years ago

Hi!

When you do split=['train'], then the return type is a list of tf.data.Dataset. If you do split='train' or use ds_train[0], then the assert should not fail.

tomvdw commented 2 years ago

Sorry, looking at the code, I think the return type is actually a dict {'train': tf.data.Dataset(...)}, so using ds_train['train'] or split='train' should work.

dluo96 commented 2 years ago

Hi @tomvdw, thanks for the reply! Your suggestions were very helpful.

It seems that tfds.load(...) returns an instance of tensorflow.python.data.ops.dataset_ops._OptionsDataset (a subclass of tf.data.Dataset I believe) when split='train'. Meanwhile, it seems to return a list when split=['train'] as you suggested in your first comment.

I think your first comment is correct based on a few experiments (see below) I ran:

Would you agree with this conclusion?

Experiment 1: Setting split='train'

import tensorflow as tf
import tensorflow_datasets as tfds

ds_train = tfds.load(
    'iris', 
    shuffle_files=True,
    split='train',
    as_supervised=True,
)

assert isinstance(ds_train, tf.data.Dataset)

Experiment 2: Use ds_train[0]

import tensorflow as tf
import tensorflow_datasets as tfds

ds_train = tfds.load(
    'iris', 
    shuffle_files=True,
    split=['train'],
    as_supervised=True,
)

assert isinstance(ds_train[0], tf.data.Dataset)

Experiment 3: Use ds_train['train']

ds_train = tfds.load(
    'iris', 
    shuffle_files=True,
    split=['train'],
    as_supervised=True,
)

assert isinstance(ds_train['train'], tf.data.Dataset)

This returns the error

TypeError: '_OptionsDataset' object is not subscriptable