keras-team / keras-io

Keras documentation, hosted live at keras.io
Apache License 2.0
2.79k stars 2.05k forks source link

batch out of range & loss value becomes 'nan' when running monocular depth estimation #1832

Open Gacha76 opened 7 months ago

Gacha76 commented 7 months ago

Issue Type

Bug

Source

source

Keras Version

Keras 2.10

Custom Code

No

OS Platform and Distribution

Windows 11

Python version

3.10.13

GPU model and memory

RTX 3050 6GB

Current Behavior?

When calling the .fit() function to train the model, the 1st epoch runs as expected and stops when all batches have been iterated.

The problem starts from the 2nd epoch onwards where batches start running out of the given range and loss values become nan. Once the epoch is complete, the UI becomes normal again but this behavior is observed again for the 3rd epoch and so on.

Screenshot 2024-04-13 111550

All tutorials on Youtube running the same Colab notebook given by the Keras Team seem to run without having any issues and the model trains properly but this isn't the case when I run the notebook locally or on Colab using both CPU and GPU.

Standalone code to reproduce the issue or tutorial link

https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/depth_estimation.ipynb

Relevant log output

No response

sachinprasadhs commented 7 months ago

You seem to be using the older Keras version in your local system. Keras 3 with multi backend feature is available, you can upgrade the Keras package and try again.

pip install -U keras

Gacha76 commented 7 months ago

Below screenshot is using Keras 3. Batch no longer goes out of range but loss values still become nan.

Screenshot 2024-04-16 164217

github-actions[bot] commented 7 months ago

Are you satisfied with the resolution of your issue? Yes No

sachinprasadhs commented 7 months ago

The tutorial which you are referring to has not been migrated to Keras 3 yet, possibly due to some dependency on Tensorflow or Keras 2 APIs.

I was able ti run the tutorial successfully for 1 epoch with TensorFlow 2.15 which uses Keras 2.15 in it's backend. Attaching the working Gist here for reference https://colab.sandbox.google.com/gist/sachinprasadhs/5aead85438db273c01e72ec257d6c09e/depth_estimation.ipynb

Gacha76 commented 7 months ago

It works for 1 epoch for me as well in both Keras 2 and Keras 3. The issue arises when I need to train the model for more than 1 epoch which results in the above behavior. Also, since the loss values become undefined, the neural network starts to output nothing but a black screen as shown here.

Screenshot 2024-04-17 090156

sachinprasadhs commented 7 months ago

In the published tutorial we can see output for more number of epochs. Since the tutorial is not yet migrated to Keras 3, we can look at it once the tutorial is migrated to Keras 3, Keras team doesn't have enough bandwidth to migrate tutorials. Community contributions are welcome.