Closed Liqi1003 closed 1 week ago
Hi @Liqi1003 ,
The difference seems precision related because of XLA fusion. Please find the developer comment on same for more details.
Hi @SuryanarayanaY,
Thanks for the pointer!
To test whether it is a problem caused by XLA, I tried to disable XLA by adding TF_XLA_FLAGS=--tf_xla_auto_jit=-1
before the command, as mentioned here. However, I still see the log indicates XLA compilation is enabled, and the outputs are the same as the one I posted above. Am I using the wrong way to disable XLA?
Also, I was only able to reproduce this problem in tensorflow 2.13. Using google colab with tensorflow 2.16.1, running the same code does not result in such inconsistency. Here is the colab.
I wonder if it is a bug that was silently fixed in later versions? Thanks!
Hi @Liqi1003 ,
It seems 2.16v has better precision than earlier versions. May be there seems some internal amendments which I am not aware. As it resolved in latest version can we mark this as closed. Thanks!
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
binary
TensorFlow version
tf 2.13.0
Custom code
Yes
OS platform and distribution
Linux Ubuntu 20.04.5
Mobile device
No response
Python version
3.8.10
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
11.8/8.7
GPU model and memory
No response
Current behavior?
I am reporting an issue encountered during the distributed training of a model with different types of devices using tensorflow. I initially encountered the bug with multiple GPUs involved, but reproduced the bug in a single GPU case. The version of tensorflow used is 2.13.0.
It is very likely an edge case. This inconsistency only occurs with the specific initial weights and inputs we provided. To make the difference more apparent given a limited amount of training data, we deliberately chose a relatively high learning rate (lr=10.0).
Before executing the code, put the model inside the same directory of the reproduce code, so that the model weights can be loaded. It is important to load the model weights, as random initial weights cannot reproduce this bug.
Standalone code to reproduce the issue
Relevant log output