tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
186.38k stars 74.31k forks source link

Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence warning when iterating over a dataset #62963

Closed p-s-p-s closed 7 months ago

p-s-p-s commented 9 months ago

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

tf 2.16

Custom code

Yes

OS platform and distribution

Linux Ubuntu 22.04

Mobile device

No response

Python version

3.10

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

There is a warning which appears after the last iteration over a dataset: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

This warning report was introduced by this commit: https://github.com/tensorflow/tensorflow/commit/04fb826f98b92dd172ad665d8a5522a2f8201867 I believe that simple iteration over a dataset shouldn't cause such behavior.

Standalone code to reproduce the issue

import tensorflow as tf

range_ds = tf.data.Dataset.range(10)

for d in range_ds:
   print(d)

Relevant log output

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
2024-02-15 08:27:36.782604: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
sushreebarsa commented 8 months ago

@p-s-p-s I tried to replicate the issue on colab and didn't face the error reported. Could you check this gist and let us know? Thank you!

p-s-p-s commented 8 months ago

@sushreebarsa tf 2.15 is not affected, but 2.16 and 2.17 are.

p-s-p-s commented 8 months ago

@sushreebarsa The reason you couldn't reproduce the error in colab is because the warnings are suppressed by default. Could you please check this colab https://colab.research.google.com/drive/1JuQriKXe-aJBAbValQK-8BFGtktzw4IW?usp=sharing ?

sushreebarsa commented 8 months ago

@p-s-p-s TF v2.15 is the latest stable version so error is not appearing there. We recommend you to use the stable TF version. Thank you!

p-s-p-s commented 8 months ago

@sushreebarsa I reported this issue in order to make it fixed before 2.16 release. Moreover, tf 2.15 with https://github.com/tensorflow/tensorflow/commit/04fb826f98b92dd172ad665d8a5522a2f8201867 applied is also internally affected by this issue.

>>> import tensorflow as tf
2024-02-21 10:47:33.109381: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-21 10:47:33.109409: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-21 10:47:33.110044: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-21 10:47:33.113595: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> range_ds = tf.data.Dataset.range(10)
2024-02-21 10:47:54.344759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22462 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6
>>> 
>>> for d in range_ds:
...    print(d)
... 
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
2024-02-21 10:47:56.048435: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
>>> tf.__version__
'2.15.0'

I am not sure what causes the problem, but as a symptomatic solution it is possible to disable some warnings like this:

  if (!absl::StrContains(status.message(), "End of sequence")) {
    LOG(WARNING) << "Local rendezvous is aborting with status: " << status;
  }
sushreebarsa commented 8 months ago

@sachinprasadhs I was able to replicate the issue reported here, please have a look. Thank you!

SomeUserName1 commented 8 months ago

Can confirm this issue with tf-nightly '2.17.0-dev20240210'

2024-02-26 04:03:26.379054: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence  
     [[{{node IteratorGetNext}}]]   
     [[IteratorGetNext/_4]]  
2024-02-26 04:03:26.379063: I tensorflow/core/framework/local_rendezvous.cc:422] Local rendezvous recv item cancelled. Key hash: 381694510697024129
2024-02-26 04:03:26.379073: I tensorflow/core/framework/local_rendezvous.cc:422] Local rendezvous recv item cancelled. Key hash: 6451170228096927380
pedro-curto commented 8 months ago

Hello, I'd like to look into this issue and try to fix it, if that is possible.

obriensystems commented 8 months ago

Issue running default tensorflow training job after docker rebuild only on RTX-A4500

from tensorflow/tensorflow:latest-gpu

[+] Building 137.8s (9/9) FINISHED                                                                           docker:default
 => [internal] load build definition from Dockerfile                                                                   0.0s
 => => transferring dockerfile: 285B                                                                                   0.0s
 => [internal] load .dockerignore                                                                                      0.0s
 => => transferring context: 2B                                                                                        0.0s
 => [internal] load metadata for docker.io/tensorflow/tensorflow:latest-gpu                                            1.2s
 => [auth] tensorflow/tensorflow:pull token for registry-1.docker.io                                                   0.0s
 => [1/3] FROM docker.io/tensorflow/tensorflow:latest-gpu@sha256:4ab9ffddd6ffacc9251ac6439f431eb38d66200d3f52397b5d  135.7s

         [[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 8s 316ms/step - accuracy: 0.1601 - loss: 8.1802
Epoch 3/100
24/25 ━━━━━━━━━━━━━━━━━━━━ 0s 310ms/step - accuracy: 0.2835 - loss: 7.41432024-03-10 04:24:44.885256: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:44.885302: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
2024-03-10 04:24:44.896965: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:44.897036: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 8s 307ms/step - accuracy: 0.2795 - loss: 7.2638

https://github.com/ObrienlabsDev/machine-learning/issues/16

salaki commented 8 months ago

My understanding is the error will not affect the execution but the iterator was not usable after the error? https://stackoverflow.com/questions/53930242/how-to-fix-a-outofrangeerror-end-of-sequence-error-when-training-a-cnn-with-t

p-s-p-s commented 8 months ago

@salaki I didn't observe any negative impact, except it is quite annoying to receive this warning every time you iterate over a dataset. As a temporary fix for 2.16.1 I just commented out this line in tensorflow/core/framework/local_rendezvous.cc // LOG(WARNING) << "Local rendezvous is aborting with status: " << status; and recompiled TF from source.

ManfredLange commented 7 months ago

Another example to reproduce this issue with Python 3.12.2 and TensorFlow 2.16.1 is the fourth installment of the introductory videos, "TensorFlow ML Zero to Hero". The fourth part uses this notebook. When training every second epoch falls over with the issue reported here.

When I switch to TensorFlow 2.15.1, I also have to downgrade to Python version 3.11.8 which is something I'd like to avoid. Ideally TensorFlow 2.15.1 should be made available to the most recent stable release of Python at least until a newer stable version of TensorFlow becomes available. Combo Python 3.11.8 and TensorFlow 2.15.1 works for the given notebook.

Here is the link to that notebook that I mentioned. It runs fine online but not locally if using Python 3.12.2 and TensorFlow 2.16.1.

https://colab.research.google.com/github/lmoroney/dlaicourse/blob/master/Course%202%20-%20Part%208%20-%20Lesson%202%20-%20Notebook%20(RockPaperScissors).ipynb

This link is also accessible form the description in the video at https://www.youtube.com/watch?v=u2TjZzNuly8

I hope having another example to reproduce the problem helps with resolving this issue. Keep up the good work!

mcourteaux commented 7 months ago

@google-admin @goolge Just please fire all these "issue triagers". They are a waste of our time, and a waste of your money. All they do is copy paste the code in collab with blindfolds, fuck it up with a 90% chance, and tell you you're wrong. They are a disgrace to our intellect.

AmmarkoV commented 7 months ago

Same issue on a larger training project : screen215

I don't know if it helps or if it is related think one of the recent additions to the code was to use strategies and scopes : strategy = tf.distribute.OneDeviceStrategy(device='/gpu:0') with strategy.scope():

mpcallanan commented 7 months ago

Hi all, this didn't make it our (tf.data team's) way until just now, when an internal user flagged it. This should be fixed with https://github.com/tensorflow/tensorflow/commit/4924ec6c0b68ba3fb8f73a6383881cd4194ed802.

google-ml-butler[bot] commented 7 months ago

Are you satisfied with the resolution of your issue? Yes No

bast0320 commented 7 months ago

The error is even on the official tf website, so hopefully it will soon be fixed. https://www.tensorflow.org/tutorials/quickstart/advanced

latexalpha commented 6 months ago

Same issue on a larger training project : screen215

I don't know if it helps or if it is related think one of the recent additions to the code was to use strategies and scopes : strategy = tf.distribute.OneDeviceStrategy(device='/gpu:0') with strategy.scope():

In my circumstance, the distribution strategy is not related to this problem after my double check.

Bchi1994 commented 6 months ago

Similar error. Fixed it by removing the steps_per_epoch argument from model.fit() and model.evaluate()

import sys from matplotlib import pyplot from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Conv2D from keras.layers import MaxPooling2D from keras.layers import Dense from keras.layers import Flatten from keras.optimizers import SGD from tensorflow.keras.preprocessing.image import ImageDataGenerator import tensorflow as tf import numpy as np

physical_devices = tf.config.list_physical_devices('GPU') try: tf.config.experimental.set_memory_growth(physical_devices[0], True) except:

Invalid device or cannot modify virtual devices once initialized.

pass

define cnn model

def define_model(): model = Sequential() model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3))) model.add(MaxPooling2D((2, 2))) model.add(Flatten()) model.add(Dense(128, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='sigmoid'))

compile model

opt = SGD(learning_rate=0.001, momentum=0.9) model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy']) return model

create data generator

datagen = ImageDataGenerator(rescale=1.0/255.0) model = define_model()

prepare iterators

train_it = datagen.flow_from_directory('/workspace/workspace/cats_and_dogs_data/dogs-vs-cats/train/', class_mode='binary', batch_size=64, target_size=(200, 200)) test_it = datagen.flow_from_directory('/workspace/workspace/cats_and_dogs_data/dogs-vs-cats/test1/', class_mode='binary', batch_size=64, target_size=(200, 200))

fit model

history = model.fit(train_it, validation_data=test_it, epochs=20, verbose=1)

evaluate model

_, acc = model.evaluate(test_it, verbose=1) print('> %.3f' % (acc * 100.0))

rytis-paskauskas commented 5 months ago

I can reproduce the warning on Python 3.12 and TF 2.16. In addition, when my (custom) dataset has this 'issue' then I also get messages when calling model.evaluate(ds). That just doesn't look like it's safe to ignore it. Example:

    919/Unknown 1s 2ms/step - loss: 1.05462024-05-23 10:15:02.410200: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
     [[{{node IteratorGetNext}}]]
/usr/lib/python3.12/contextlib.py:158: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
  self.gen.throw(value)
927/927 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 1.0546

PS: I DO have enough data.

flacle commented 4 months ago

I made this disappear by simply using .repeat() and not using .cache() on my training and validation data batches.

My script has a very generic input pipeline based on the tensorflow semantic segmentation tutorial with tf 2.16 and python 3.10

Why would cache() be related to the issue? Did you see a difference with cache True or False? See here https://stackoverflow.com/a/78583999, likely a combination of .repeat() and setting the steps per epoch correctly.

andremfreitas commented 4 months ago

I made this disappear by simply using .repeat() and not using .cache() on my training and validation data batches. My script has a very generic input pipeline based on the tensorflow semantic segmentation tutorial with tf 2.16 and python 3.10

Why would cache() be related to the issue? Did you see a difference with cache True or False? See here https://stackoverflow.com/a/78583999, likely a combination of .repeat() and setting the steps per epoch correctly.

In my case I am not using .cache() neither .repeat() and I still see this error

luvwinnie commented 4 months ago

same here. using repeat() still see this error.

miticollo commented 3 months ago

I can reproduce the warning on Python 3.12 and TF 2.16. In addition, when my (custom) dataset has this 'issue' then I also get messages when calling model.evaluate(ds). That just doesn't look like it's safe to ignore it. Example:

    919/Unknown 1s 2ms/step - loss: 1.05462024-05-23 10:15:02.410200: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
   [[{{node IteratorGetNext}}]]
/usr/lib/python3.12/contextlib.py:158: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
  self.gen.throw(value)
927/927 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 1.0546

PS: I DO have enough data.

@rytis-paskauskas That warning is not related to TensorFlow, but Keras. During training and evaluation your code is wrapped in a with statement (see here and here). From now on I will consider only training the same happens for evaluation. At the first epoch, if your datasets (train, valid and test) are tf.data.Dataset, Keras can't establish how many steps (batches) are required to complete it. But don't worry, Keras "counts" batches in the first epoch and uses it for the next epochs.

In particular, enumerate_epoch uses num_batches to know how many batches there are in a epoch. But at the first epoch, this property returns None because dataset.cardinality < 0. Indeed, at the beginning your output is 919/Unknown then 927/927. This is possible because on_train_batch_end the ProgBar callback is updated.

arianmaghsoudnia commented 3 weeks ago

That warning is not related to TensorFlow, but Keras.

@miticollo I don't think that's the case. Please take a look at the same catch_stop_iteration function code in older Keras versions, even in Keras 2. It seems almost the same with only attribute name changes and not functional modifications.

miticollo commented 3 weeks ago

@arianmaghsoudnia In that comment, I referenced only to this warning:

Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.

and not to

W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

The first warning comes from Keras and, as I explained above, it can be ignored. While the latter warning comes from TensorFlow.

arianmaghsoudnia commented 3 weeks ago

@miticollo You're correct about distinguishing between the logs from TensorFlow and Keras. However, the Keras warning appears because there's an underlying issue on the TensorFlow side. The core problem is that the OUT_OF_RANGE exception should not have been triggered in the first place, as it wasn't in earlier versions of TensorFlow. The changes that closed this issue merely treat the symptoms by downgrading the warning to an info log in TensorFlow. Unfortunately, they don't resolve the root cause of the problem. In the end I understand this is a bit out of the scope of this issue. I hope that this related open issue will get attention.

tashrifbillah commented 2 weeks ago

I am getting this issue on CPU from Python 3.11 and Tensorflow 2.16.1:

Epoch 1/5
1/9 ━━━━━━━━━━━━━━━━━━━━ 1:54 14s/step - loss: 3.76182024-10-27 17:32:14.388601: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INVALID_ARGUMENT: indices[0,49] = 43 is not in [0, 43)
         [[{{function_node __inference_one_step_on_data_18465}}{{node functional_1/embedding_1/GatherV2}}]]
Traceback (most recent call last):
...
...
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Training crashes during epoch 1.


Edit: I had an index mismatch in my training labels. It is all good now.