keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
62.05k stars 19.48k forks source link

Training discrepancy between Keras 2 and 3 #20507

Open matemijolovic opened 2 hours ago

matemijolovic commented 2 hours ago

Background

I was investigating why some of our relatively simple Keras models (mostly Efficientnet-like) fail to converge after being upgraded from Keras 2 to 3. Some minor tweaks made them converge (e.g. lowering the learning rate), but I was curious about finding the underlying issue since I found no relevant documentation or release notes that would explain the training discrepancy.

Minimal reproducible example

I set up two environments in Google Colab and tried to train exactly the same model with all the random generators seeded. What I observed: I get training discrepancy when I add BatchNormalization layer.

Keras 2 / TF 2.14.1 notebook: https://colab.research.google.com/drive/1f7q-VcW7ugRPNxbCLkuE-q0O1WUJ8q-R?usp=sharing Keras 3 / TF 2.17.0 notebook: https://colab.research.google.com/drive/1ONPJ_WXM6WQoJ8ze9bJJ9tjS94aNJ4KB?usp=sharing

Please let me know if I can do some additional experiments to track down the issue.

matemijolovic commented 2 hours ago

Sorry, I noticed a bug in my comparison - will reopen if I manage to fix it.

matemijolovic commented 1 hour ago

I managed to narrow down the initial question a bit, so reopening