Open tanzhenyu opened 2 years ago
Have you checked https://github.com/tensorflow/tensorflow/issues/48845?
Have you checked tensorflow/tensorflow#48845?
This looks like a very likely root cause. @tanzhenyu can you confirm? If so, we should probably close this issue as it's a duplicate / downstream effect of the TF bug
Have you checked tensorflow/tensorflow#48845?
This looks like a very likely root cause. @tanzhenyu can you confirm? If so, we should probably close this issue as it's a duplicate / downstream effect of the TF bug
The original issue is #34062, referenced in the description.
It doesn't look like root cause is identified yet? The issue is very related though, with a couple of subtle differences:
Closing it is fine with me, I'd hope we provided more context to the issue for debugging.
@tanzhenyu Please check if can reprdouce with drop-reminder=true
in 'tf.dataset` batch formation
@tanzhenyu Please check if can reprdouce with
drop-reminder=true
in 'tf.dataset` batch formation
@bhack I have confirmed with both models (DeepLabV3 & DeepLabV3Plus) that drop_remainder=True can resolve this issue. What does lead to?
It was two months ago so I don't remember exactly the analysis I've done.
Can you debug/print the specific batch size near or on Nan
step (of course without introducing drop-reminder=True
)?
System information.
Describe the problem.
While training works fine, (ONLY) one of the batchnorm gives nan output during inference. But everything works fine if we set
BatchNormalization(fused=False)
.Describe the current behavior. While I'm adding DeepLabV3 to KerasCV, one of the weirdest thing came up -- the model trains fine, but during the 1st evaluation, the validation loss goes to NAN. However I can still keep training the model for more epochs given the training loss is still legitimate. After some digging, I found out that the NAN (during evaluation or inference) is coming from the output of a BatchNorm layer here. However please be aware that there are 10+ BatchNorm layers being used in my model, this is the only one that gives NAN output (when
training=False
). This happens right at the end of 1st epoch. Then I tracked the variables of this layer, I found out that themoving_variance
is slowly increasing from 1.0 to 3.0 during the 1st training epoch, but suddenly changed to NAN before switching to evaluation. I have went through my input and checked that I am not feeding NAN into the model. So I tried several other options: 1) usingfused=False
, given that I have seen others report a similar issue. The NAN issue goes away. 2) using experimental SyncBatchNorm. The NAN issue goes away.Describe the expected behavior.
The behavior for
fused=True
should be same withfused=False
. No NAN issue should occur.Contributing.
Standalone code to reproduce the issue.
Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.
Source code / logs. https://github.com/keras-team/keras-cv/blob/aug/examples/training/semantic_segmentation/pascal_voc/deeplab_v3.py
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.