tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.3k stars 1.54k forks source link

Slow 'encode_example' for Sequence #1284

Open 0x0badc0de opened 4 years ago

0x0badc0de commented 4 years ago

Short description Probably just like #1093 but for encoding. We have a lot of data stored in tfds.features.Sequence(tf.int64) and tfds.features.Sequence(tf.float32) and the more data we add the slower dataset creation becomes.

Some measurements: Default dataset - 14min 8s Dataset with sequence features values set to empty list - 16s All sequence feature values wrapped in np.array - 12min 46s With those lines commented https://github.com/tensorflow/datasets/blob/v1.3.0/tensorflow_datasets/core/features/sequence_feature.py#L142-L145 - 5min 32s (dataset seems to be correct after it)

Environment information

Reproduction instructions

Create a dataset with sequence of int/float features, populate each feature w/ a list of 10k values, measure dataset creation time. Comment out lines at https://github.com/tensorflow/datasets/blob/v1.3.0/tensorflow_datasets/core/features/sequence_feature.py#L142-L145 , re-create dataset, measure creation time, verify the dataset is the same. Probably some existing dataset can expose the behaviour as well.

Link to logs Code in question converts a list like this:

[0.20000000298023224, 0.4000000059604645, 0.0, 0.0, 0.20000000298023224, 0.10000000149011612, 0.44999998807907104, 0.05000000074505806, 0.0, 

into

[array(0.2, dtype=float32), array(0.4, dtype=float32), array(0., dtype=float32), array(0., dtype=float32), array(0.2, dtype=float32), array(0.1, dtype=float32), array(0.45, dtype=float32), array(0.05, dtype=float32), array(0., dtype=float32),

Guess multiple array creation is the cause of slowness.

Expected behavior Sequence of numbers shouldn't take too much time to be encoded.

Conchylicultor commented 4 years ago

Thanks for reporting. Contrary to https://github.com/tensorflow/datasets/issues/1093, Tensor.encode_example execute some type checking on the values that we should try to keep.

1) At first, we could try to optimize Tensor.encode_example to keep the same shape/dtype checking, but without converting to np.array first. That should make the encoding faster.

2) If this is still not enough, we might have to add a new encode_list_example, to perform batch encoding, though I would like to avoid it if possible. That would be more tricky as All features which inherit from Tensor would have to be modified.

I don't think I'll have time in the near future to look at this. If you have time, could you try (1) and see if this is enough for your use case ? The bottleneck seems to be at at:https://github.com/tensorflow/datasets/blob/v1.3.0/tensorflow_datasets/core/features/feature.py#L476

0x0badc0de commented 4 years ago

The only thing I was able to improve in Tensor.encode_example is caching np_dtype as self._np_dtype = np.dtype(self.dtype.as_numpy_dtype) in Tensor.__init__, this change is invalid if np.dtype(self.dtype.as_numpy_dtype) expression can return different results after __init__ is done.

This gives 7% speed up which is not so big improvement comparing to 3x times if Tensor.encode_example is not invoked at all.

Did I get you right that Tensor.encode_example can return example_data parameter as is if it has the same shape/dtype? E.g. will following code break anything?

if self._shape == () and np.isscalar(example_data) and np.dtype(type(example_data)) == self._np_dtype:
  return example_data

Anyway, this gives 20% boost only. We regenerate datasets weekly so the issue is not of great importance at the moment. Just wanted to share there is a possibility for improvement.

Conchylicultor commented 4 years ago

Thanks for investigating and for the benchmarks. I don't know yet when we'll have time to look into this, but those information will be useful.

0x0badc0de commented 3 years ago

For those who is facing the same issue. We were able to get 20x speed up for one of our datasets by replacing tfds.features.Text with tf.string and tfds.features.Sequence(tf.float32) with tfds.features.Tensor(shape=(None, ), dtype=tf.float32).

Replacing Sequence w/ Tensor gave the most effect.