tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.31k stars 1.55k forks source link

tfds.as_numpy(...) handling of RaggedTensors #3906

Open rodrigob opened 2 years ago

rodrigob commented 2 years ago

Hello, I was recently bitten by tfds.as_numpy(...) handling of RaggedTensors,

Note that because TensorFlow has support for ragged tensors and NumPy has no equivalent representation,
 [tf.RaggedTensors](https://www.tensorflow.org/api_docs/python/tf/RaggedTensor)
 are left as-is for the user to deal with them (e.g. using to_list()).
In TF 1 (i.e. graph mode), tf.RaggedTensors are returned as tf.ragged.RaggedTensorValues.

however the ragged tensor documentation indicates Ragged tensors can be converted to nested Python lists and NumPy arrays: [...] digits.numpy().

It seems then that _elem_to_numpy_eager(...) would need updating.

Conchylicultor commented 2 years ago

Yes, however, the ragged_tensor.numpy() remove information on the existing ragged tensor. Depending on your use-case might, it cam make it more complicated to use. Because you would now have to manually parse the nested list.

For example, accessing flat_values or row_limits is not possible anymore on ragged_tensor.numpy(), so keeping tf allow things like:

ragged_tensor.flat_values.numpy()
ragged_tensor.row_limits().numpy()

The proper solution would be to have some custom numpy ragged tensor support, but would require more important engineering efforts.

rodrigob commented 2 years ago

Another option would be to have an argument in tfds.as_numpy() that indicates how to treat RaggedTensors (with default to "return as-is").

Without it I resorted to implementing my own versions of:

def _eager_dataset_iterator(ds: tf.data.Dataset) -> t.Iterator[NumpyElem]
def _elem_to_numpy_eager(tf_el: TensorflowElem) -> t.Union[NumpyElem, t.Iterable[NumpyElem]]
def tfds_as_numpy(dataset: tf.data.Dataset) -> tf.data.Dataset

where _elem_to_numpy_eager includes

elif isinstance(tf_el, tf.RaggedTensor):
    return tf_el.numpy() 

As long as tfds.core.dataset_utils does not change much, this will work fine for my use case; but feels like slightly abusing the library.

Conchylicultor commented 2 years ago

Also note you can also use ds.as_numpy_iterator() which should directly returns tf.RaggedTensor as numpy.

as_numpy_iterator has some difference with tfds.as_numpy though. Like not supporting None, len(ds),... but it's been a while I haven't tried, so those might have been fixed.

rodrigob commented 2 years ago

Indeed that seems to fit better my use case. Then I suggest that tfds.as_numpy(...) docstring should mention something along the lines of "for eager mode only use cases, ds.as_numpy_iterator(...) might be a better fit." (specially for use cases with RaggedTensor)