Keras BatchNorm layer gives NAN during inference

tanzhenyu commented 2 years ago

System information.

Have I written custom code (as opposed to using a stock example script provided in Keras): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux, (fails in both Ubuntu or Debian)
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.10
Python version: 3.7.12
Bazel version (if compiling from source): N/A
GPU model and memory: V100 (16GB) & RTX 3080 (10GB)
Exact command to reproduce:
To reproduce the behavior where there is no NAN issue: python examples/training/semantic_segmentation/pascal_voc/deeplab_v3.py
To reproduce the NAN issue: swapping from SyncBatchNorm to BatchNorm here, re-run the above code

Describe the problem.

While training works fine, (ONLY) one of the batchnorm gives nan output during inference. But everything works fine if we set BatchNormalization(fused=False).

Describe the current behavior. While I'm adding DeepLabV3 to KerasCV, one of the weirdest thing came up -- the model trains fine, but during the 1st evaluation, the validation loss goes to NAN. However I can still keep training the model for more epochs given the training loss is still legitimate. After some digging, I found out that the NAN (during evaluation or inference) is coming from the output of a BatchNorm layer here. However please be aware that there are 10+ BatchNorm layers being used in my model, this is the only one that gives NAN output (when training=False). This happens right at the end of 1st epoch. Then I tracked the variables of this layer, I found out that the moving_variance is slowly increasing from 1.0 to 3.0 during the 1st training epoch, but suddenly changed to NAN before switching to evaluation. I have went through my input and checked that I am not feeding NAN into the model. So I tried several other options: 1) using fused=False, given that I have seen others report a similar issue. The NAN issue goes away. 2) using experimental SyncBatchNorm. The NAN issue goes away.

Describe the expected behavior.

The behavior for fused=True should be same with fused=False. No NAN issue should occur.

Contributing.

Do you want to contribute a PR? (yes/no): no
If yes, please read this page for instructions
Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue.

Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Source code / logs. https://github.com/keras-team/keras-cv/blob/aug/examples/training/semantic_segmentation/pascal_voc/deeplab_v3.py

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

bhack commented 2 years ago

Have you checked https://github.com/tensorflow/tensorflow/issues/48845?

ianstenbit commented 2 years ago

Have you checked tensorflow/tensorflow#48845?

This looks like a very likely root cause. @tanzhenyu can you confirm? If so, we should probably close this issue as it's a duplicate / downstream effect of the TF bug

tanzhenyu commented 2 years ago

Have you checked tensorflow/tensorflow#48845?

This looks like a very likely root cause. @tanzhenyu can you confirm? If so, we should probably close this issue as it's a duplicate / downstream effect of the TF bug

The original issue is #34062, referenced in the description.

It doesn't look like root cause is identified yet? The issue is very related though, with a couple of subtle differences:

the TF bug uses batch_size=1. This issue happen regardless of the batch size, but specifically with height=width=1
it only happens with one batchnorm, while other BN layers are working fine.

Closing it is fine with me, I'd hope we provided more context to the issue for debugging.

bhack commented 2 years ago

@tanzhenyu Please check if can reprdouce with drop-reminder=true in 'tf.dataset` batch formation

tanzhenyu commented 1 year ago

@tanzhenyu Please check if can reprdouce with drop-reminder=true in 'tf.dataset` batch formation

@bhack I have confirmed with both models (DeepLabV3 & DeepLabV3Plus) that drop_remainder=True can resolve this issue. What does lead to?

bhack commented 1 year ago

It was two months ago so I don't remember exactly the analysis I've done.

Can you debug/print the specific batch size near or on Nan step (of course without introducing drop-reminder=True)?

keras-team / tf-keras

Keras BatchNorm layer gives NAN during inference #379