Closed p-s-p-s closed 7 months ago
@p-s-p-s I tried to replicate the issue on colab and didn't face the error reported. Could you check this gist and let us know? Thank you!
@sushreebarsa tf 2.15 is not affected, but 2.16 and 2.17 are.
@sushreebarsa The reason you couldn't reproduce the error in colab is because the warnings are suppressed by default. Could you please check this colab https://colab.research.google.com/drive/1JuQriKXe-aJBAbValQK-8BFGtktzw4IW?usp=sharing ?
@p-s-p-s TF v2.15 is the latest stable version so error is not appearing there. We recommend you to use the stable TF version. Thank you!
@sushreebarsa I reported this issue in order to make it fixed before 2.16 release. Moreover, tf 2.15 with https://github.com/tensorflow/tensorflow/commit/04fb826f98b92dd172ad665d8a5522a2f8201867 applied is also internally affected by this issue.
>>> import tensorflow as tf
2024-02-21 10:47:33.109381: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-21 10:47:33.109409: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-21 10:47:33.110044: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-21 10:47:33.113595: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> range_ds = tf.data.Dataset.range(10)
2024-02-21 10:47:54.344759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22462 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6
>>>
>>> for d in range_ds:
... print(d)
...
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
2024-02-21 10:47:56.048435: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
>>> tf.__version__
'2.15.0'
I am not sure what causes the problem, but as a symptomatic solution it is possible to disable some warnings like this:
if (!absl::StrContains(status.message(), "End of sequence")) {
LOG(WARNING) << "Local rendezvous is aborting with status: " << status;
}
@sachinprasadhs I was able to replicate the issue reported here, please have a look. Thank you!
Can confirm this issue with tf-nightly '2.17.0-dev20240210'
2024-02-26 04:03:26.379054: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
[[{{node IteratorGetNext}}]]
[[IteratorGetNext/_4]]
2024-02-26 04:03:26.379063: I tensorflow/core/framework/local_rendezvous.cc:422] Local rendezvous recv item cancelled. Key hash: 381694510697024129
2024-02-26 04:03:26.379073: I tensorflow/core/framework/local_rendezvous.cc:422] Local rendezvous recv item cancelled. Key hash: 6451170228096927380
Hello, I'd like to look into this issue and try to fix it, if that is possible.
Issue running default tensorflow training job after docker rebuild only on RTX-A4500
from tensorflow/tensorflow:latest-gpu
[+] Building 137.8s (9/9) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 285B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/tensorflow/tensorflow:latest-gpu 1.2s
=> [auth] tensorflow/tensorflow:pull token for registry-1.docker.io 0.0s
=> [1/3] FROM docker.io/tensorflow/tensorflow:latest-gpu@sha256:4ab9ffddd6ffacc9251ac6439f431eb38d66200d3f52397b5d 135.7s
[[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 8s 316ms/step - accuracy: 0.1601 - loss: 8.1802
Epoch 3/100
24/25 ━━━━━━━━━━━━━━━━━━━━ 0s 310ms/step - accuracy: 0.2835 - loss: 7.41432024-03-10 04:24:44.885256: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:44.885302: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
2024-03-10 04:24:44.896965: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:44.897036: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 8s 307ms/step - accuracy: 0.2795 - loss: 7.2638
My understanding is the error will not affect the execution but the iterator was not usable after the error? https://stackoverflow.com/questions/53930242/how-to-fix-a-outofrangeerror-end-of-sequence-error-when-training-a-cnn-with-t
@salaki
I didn't observe any negative impact, except it is quite annoying to receive this warning every time you iterate over a dataset. As a temporary fix for 2.16.1 I just commented out this line in tensorflow/core/framework/local_rendezvous.cc
// LOG(WARNING) << "Local rendezvous is aborting with status: " << status;
and recompiled TF from source.
Another example to reproduce this issue with Python 3.12.2 and TensorFlow 2.16.1 is the fourth installment of the introductory videos, "TensorFlow ML Zero to Hero". The fourth part uses this notebook. When training every second epoch falls over with the issue reported here.
When I switch to TensorFlow 2.15.1, I also have to downgrade to Python version 3.11.8 which is something I'd like to avoid. Ideally TensorFlow 2.15.1 should be made available to the most recent stable release of Python at least until a newer stable version of TensorFlow becomes available. Combo Python 3.11.8 and TensorFlow 2.15.1 works for the given notebook.
Here is the link to that notebook that I mentioned. It runs fine online but not locally if using Python 3.12.2 and TensorFlow 2.16.1.
This link is also accessible form the description in the video at https://www.youtube.com/watch?v=u2TjZzNuly8
I hope having another example to reproduce the problem helps with resolving this issue. Keep up the good work!
@google-admin @goolge Just please fire all these "issue triagers". They are a waste of our time, and a waste of your money. All they do is copy paste the code in collab with blindfolds, fuck it up with a 90% chance, and tell you you're wrong. They are a disgrace to our intellect.
Same issue on a larger training project :
I don't know if it helps or if it is related think one of the recent additions to the code was to use strategies and scopes : strategy = tf.distribute.OneDeviceStrategy(device='/gpu:0') with strategy.scope():
Hi all, this didn't make it our (tf.data team's) way until just now, when an internal user flagged it. This should be fixed with https://github.com/tensorflow/tensorflow/commit/4924ec6c0b68ba3fb8f73a6383881cd4194ed802.
The error is even on the official tf website, so hopefully it will soon be fixed. https://www.tensorflow.org/tutorials/quickstart/advanced
Same issue on a larger training project :
I don't know if it helps or if it is related think one of the recent additions to the code was to use strategies and scopes : strategy = tf.distribute.OneDeviceStrategy(device='/gpu:0') with strategy.scope():
In my circumstance, the distribution strategy is not related to this problem after my double check.
Similar error. Fixed it by removing the steps_per_epoch argument from model.fit() and model.evaluate()
import sys from matplotlib import pyplot from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Conv2D from keras.layers import MaxPooling2D from keras.layers import Dense from keras.layers import Flatten from keras.optimizers import SGD from tensorflow.keras.preprocessing.image import ImageDataGenerator import tensorflow as tf import numpy as np
physical_devices = tf.config.list_physical_devices('GPU') try: tf.config.experimental.set_memory_growth(physical_devices[0], True) except:
pass
def define_model(): model = Sequential() model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3))) model.add(MaxPooling2D((2, 2))) model.add(Flatten()) model.add(Dense(128, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='sigmoid'))
opt = SGD(learning_rate=0.001, momentum=0.9) model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy']) return model
datagen = ImageDataGenerator(rescale=1.0/255.0) model = define_model()
train_it = datagen.flow_from_directory('/workspace/workspace/cats_and_dogs_data/dogs-vs-cats/train/', class_mode='binary', batch_size=64, target_size=(200, 200)) test_it = datagen.flow_from_directory('/workspace/workspace/cats_and_dogs_data/dogs-vs-cats/test1/', class_mode='binary', batch_size=64, target_size=(200, 200))
history = model.fit(train_it, validation_data=test_it, epochs=20, verbose=1)
_, acc = model.evaluate(test_it, verbose=1) print('> %.3f' % (acc * 100.0))
I can reproduce the warning on Python 3.12 and TF 2.16. In addition, when my (custom) dataset has this 'issue' then I also get messages when calling model.evaluate(ds)
. That just doesn't look like it's safe to ignore it. Example:
919/Unknown 1s 2ms/step - loss: 1.05462024-05-23 10:15:02.410200: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
[[{{node IteratorGetNext}}]]
/usr/lib/python3.12/contextlib.py:158: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
self.gen.throw(value)
927/927 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 1.0546
PS: I DO have enough data.
I made this disappear by simply using
.repeat()
and not using.cache()
on my training and validation data batches.My script has a very generic input pipeline based on the tensorflow semantic segmentation tutorial with tf 2.16 and python 3.10
Why would cache()
be related to the issue? Did you see a difference with cache True
or False
? See here https://stackoverflow.com/a/78583999, likely a combination of .repeat()
and setting the steps per epoch correctly.
I made this disappear by simply using
.repeat()
and not using.cache()
on my training and validation data batches. My script has a very generic input pipeline based on the tensorflow semantic segmentation tutorial with tf 2.16 and python 3.10Why would
cache()
be related to the issue? Did you see a difference with cacheTrue
orFalse
? See here https://stackoverflow.com/a/78583999, likely a combination of.repeat()
and setting the steps per epoch correctly.
In my case I am not using .cache()
neither .repeat()
and I still see this error
same here. using repeat() still see this error.
I can reproduce the warning on Python 3.12 and TF 2.16. In addition, when my (custom) dataset has this 'issue' then I also get messages when calling
model.evaluate(ds)
. That just doesn't look like it's safe to ignore it. Example:919/Unknown 1s 2ms/step - loss: 1.05462024-05-23 10:15:02.410200: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence [[{{node IteratorGetNext}}]] /usr/lib/python3.12/contextlib.py:158: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset. self.gen.throw(value) 927/927 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 1.0546
PS: I DO have enough data.
@rytis-paskauskas That warning is not related to TensorFlow, but Keras. During training and evaluation your code is wrapped in a with
statement (see here and here). From now on I will consider only training the same happens for evaluation. At the first epoch, if your datasets (train, valid and test) are tf.data.Dataset
, Keras can't establish how many steps (batches) are required to complete it. But don't worry, Keras "counts" batches in the first epoch and uses it for the next epochs.
In particular, enumerate_epoch
uses num_batches
to know how many batches there are in a epoch. But at the first epoch, this property returns None
because dataset.cardinality < 0
. Indeed, at the beginning your output is 919/Unknown
then 927/927
. This is possible because on_train_batch_end
the ProgBar callback is updated.
@arianmaghsoudnia In that comment, I referenced only to this warning:
Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
and not to
W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
The first warning comes from Keras and, as I explained above, it can be ignored. While the latter warning comes from TensorFlow.
@miticollo You're correct about distinguishing between the logs from TensorFlow and Keras. However, the Keras warning appears because there's an underlying issue on the TensorFlow side. The core problem is that the OUT_OF_RANGE
exception should not have been triggered in the first place, as it wasn't in earlier versions of TensorFlow. The changes that closed this issue merely treat the symptoms by downgrading the warning to an info log in TensorFlow. Unfortunately, they don't resolve the root cause of the problem.
In the end I understand this is a bit out of the scope of this issue. I hope that this related open issue will get attention.
I am getting this issue on CPU from Python 3.11 and Tensorflow 2.16.1:
Epoch 1/5
1/9 ━━━━━━━━━━━━━━━━━━━━ 1:54 14s/step - loss: 3.76182024-10-27 17:32:14.388601: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INVALID_ARGUMENT: indices[0,49] = 43 is not in [0, 43)
[[{{function_node __inference_one_step_on_data_18465}}{{node functional_1/embedding_1/GatherV2}}]]
Traceback (most recent call last):
...
...
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
Training crashes during epoch 1.
Edit: I had an index mismatch in my training labels. It is all good now.
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
binary
TensorFlow version
tf 2.16
Custom code
Yes
OS platform and distribution
Linux Ubuntu 22.04
Mobile device
No response
Python version
3.10
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
There is a warning which appears after the last iteration over a dataset: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
This warning report was introduced by this commit: https://github.com/tensorflow/tensorflow/commit/04fb826f98b92dd172ad665d8a5522a2f8201867 I believe that simple iteration over a dataset shouldn't cause such behavior.
Standalone code to reproduce the issue
Relevant log output