Open sayakpaul opened 2 years ago
Adding to the above. I noticed that this error can be avoided by using infinite datasets for both train and validation (.repeat()
). If either or both of the train and validation datasets are non-infinite, the same or a similar error pops up.
If training is done with with a non-infinite dataset (irrespective of the nature of the validation set), the same error pops up, but during the training phase.
All of the above holds for tf.distribute.MirroredStrategy()
at least. Hope this helps.
I am able to reproduce the reported issue in colab with TF 2.6
and TF 2.7
as well. Please find the gist here. Thanks!
Could you help make this repro a little more minimal to be able to triangulate what's going wrong? Do we know the minimum piece of code to trigger this error?
One thing just browsing that did look suspicious is that the ragged tf.keras.Input
was not actually created with ragged=True
, though that may not be related to the error being seen here.
Could you help make this repro a little more minimal to be able to triangulate what's going wrong? Do we know the minimum piece of code to trigger this error?
Actually, we minimalized our implementation to a great extent to produce the current snippet. In our understanding, this is the minimum piece of code that triggers the error.
@mattdangerw @sayakpaul
I created another notebook that has the bare minimum (or as close as I could get to bare minimum) code needed to reproduce the issue. Note that the notebook is fetching a different set of TFRecords from the previous notebook.
I hope this helps!
@Nilabhra , Could you please check the links of the files you have attached, both the links giving 404 error.
@sachinprasadhs Thanks for letting me know, I have updated the link in the comment.
@sachinprasadhs Could you check the updated link?
@mattdangerw
@Nilabhra has further simplified the notebook reducing the amount of code needed to reproduce the said issue. Let us know if you'd need anything else.
It looks like there's a bug in the create_dummy_tensor
function in distribute/input_lib.py
, where it does something fairly nonsensical if the rank of the feature is unknown. I'm not entirely clear on how these "dummy tensors" get used, but my best guess is that this could be fixed on TensorFlow's end by a change such as this (new lines marked with "NEW"):
[tensorflow/python/distribute/input_lib.py, in create_dummy_tensor]
if isinstance(spec, ragged_tensor.RaggedTensorSpec):
if not dims: ## NEW
dummy_tensor = tf.zeros([0], feature_type) ## NEW
row_splits = array_ops.zeros(1, spec._row_splits_dtype)
dummy_tensor = ragged_tensor.RaggedTensor.from_nested_row_splits(
dummy_tensor, (row_splits,) * spec._ragged_rank, validate=False)
Alternatively, you could modify your data-loading code to ensure that at least the rank of the input feature is known. E.g., if I change read_ragged_feature
in the linked colab to the following definition, then the colab works:
def read_ragged_feature(feature_name, feature, ragged_rank):
ragged_feature = {}
ragged_feature[feature_name] = deserialize_composite(
feature, tf.RaggedTensorSpec(dtype=tf.int32, ragged_rank=ragged_rank),
)
ragged_feature[feature_name].flat_values.set_shape([None]) # NEW
return ragged_feature
(This assumes that you statically know the rank of your input tensors -- in this case, I was assuming that there are no "uniform inner" dimensions beyond the ragged dimensions, but you could adjust it if that's not the case for you. If you don't statically know the rank of your input tensors, then this won't help, but I think having unknown ranks for input tensors is fairly rare.)
@edloper Thank you so much for taking the time to work on this bug. I guess I can incorporate either of the solutions your provided, the second one being the easier one to do so. I hope the dev team takes notice of this and patches the bug soon.
Please go to TF Forum for help and support:
https://discuss.tensorflow.org/tag/keras
If you open a GitHub issue, here is our policy:
It must be a bug, a feature request, or a significant problem with the documentation (for small docs fixes please send a PR instead). The form below must be filled out.
Here's why we have that policy:.
Keras developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.
System information.
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the problem.
We have a model that consumes multiple ragged tensors in a batch. Our model runs perfectly fine on a single GPU. But the moment we introduce distributed training, its evaluation fails.
Note that the training during the distributed settings proceeds smoothly but it's during the evaluation it fails. Since we cannot provide the original data and model, we are using we are providing a minimal snippet in the following notebook that reproduces the issue. You can use Colab to reproduce the issue as well as a multi-GPU machine. We have verified on both and the issue persists.
Describe the current behavior.
Model consuming RaggedTensors fails during evaluation in a distributed setting.
Describe the expected behavior.
The model should run during evaluation without any errors when exposed to a distributed setting.
Contributing.
Standalone code to reproduce the issue.
Colab Notebook: https://colab.research.google.com/drive/1U9oeed5OMAH1KvN5T455kAsB2Nsh1-KF?usp=sharing.
Source code / logs.
Cc: @Nilabhra