Open 0x0badc0de opened 4 years ago
Thanks for reporting. Contrary to https://github.com/tensorflow/datasets/issues/1093, Tensor.encode_example
execute some type checking on the values that we should try to keep.
1) At first, we could try to optimize Tensor.encode_example
to keep the same shape/dtype checking, but without converting to np.array
first. That should make the encoding faster.
2) If this is still not enough, we might have to add a new encode_list_example
, to perform batch encoding, though I would like to avoid it if possible. That would be more tricky as All features which inherit from Tensor
would have to be modified.
I don't think I'll have time in the near future to look at this. If you have time, could you try (1) and see if this is enough for your use case ? The bottleneck seems to be at at:https://github.com/tensorflow/datasets/blob/v1.3.0/tensorflow_datasets/core/features/feature.py#L476
The only thing I was able to improve in Tensor.encode_example
is caching np_dtype
as self._np_dtype = np.dtype(self.dtype.as_numpy_dtype)
in Tensor.__init__
, this change is invalid if np.dtype(self.dtype.as_numpy_dtype)
expression can return different results after __init__
is done.
This gives 7% speed up which is not so big improvement comparing to 3x times if Tensor.encode_example
is not invoked at all.
Did I get you right that Tensor.encode_example
can return example_data
parameter as is if it has the same shape/dtype? E.g. will following code break anything?
if self._shape == () and np.isscalar(example_data) and np.dtype(type(example_data)) == self._np_dtype:
return example_data
Anyway, this gives 20% boost only. We regenerate datasets weekly so the issue is not of great importance at the moment. Just wanted to share there is a possibility for improvement.
Thanks for investigating and for the benchmarks. I don't know yet when we'll have time to look into this, but those information will be useful.
For those who is facing the same issue. We were able to get 20x speed up for one of our datasets by replacing tfds.features.Text
with tf.string
and tfds.features.Sequence(tf.float32)
with tfds.features.Tensor(shape=(None, ), dtype=tf.float32)
.
Replacing Sequence w/ Tensor gave the most effect.
Short description Probably just like #1093 but for encoding. We have a lot of data stored in
tfds.features.Sequence(tf.int64)
andtfds.features.Sequence(tf.float32)
and the more data we add the slower dataset creation becomes.Some measurements: Default dataset - 14min 8s Dataset with sequence features values set to empty list - 16s All sequence feature values wrapped in np.array - 12min 46s With those lines commented https://github.com/tensorflow/datasets/blob/v1.3.0/tensorflow_datasets/core/features/sequence_feature.py#L142-L145 - 5min 32s (dataset seems to be correct after it)
Environment information
tensorflow-datasets
/tfds-nightly
version: tensorflow-datasets==1.3.0tensorflow
/tensorflow-gpu
/tf-nightly
/tf-nightly-gpu
version: tensorflow==1.14.0Reproduction instructions
Create a dataset with sequence of int/float features, populate each feature w/ a list of 10k values, measure dataset creation time. Comment out lines at https://github.com/tensorflow/datasets/blob/v1.3.0/tensorflow_datasets/core/features/sequence_feature.py#L142-L145 , re-create dataset, measure creation time, verify the dataset is the same. Probably some existing dataset can expose the behaviour as well.
Link to logs Code in question converts a list like this:
into
Guess multiple array creation is the cause of slowness.
Expected behavior Sequence of numbers shouldn't take too much time to be encoded.