PIDGAN shows different timing performance for different Keras versions

mbarbetti commented 3 months ago

Using the latest version of PIDGAN (v0.2.0), we have noticed an unexpected behavior while running GAN trainings with Keras 2 or Keras 3. In particular, taking as a reference scripts/train_GAN_Rich.py for 10 epochs and with a dataset of 300000 instances, we observe a drop of timing performance of about 20% from Keras 2 to Keras 3.

Test machine details: Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz (no GPU card equipped)

Launched command:

python train_GAN_Rich.py -p pion -E 10 -C 300_000 -D 2016MU --test

Running on Keras 2.14.0:

[...]
Epoch 10/10
102/102 [==============================] - 4s 41ms/step - g_loss: 1.5768 - d_loss: 0.5838 - accuracy: 0.2408 - bce: 2.3640 - g_lr: 3.9292e-04 - d_lr: 4.9445e-04 - val_g_loss: 1.1388 - val_d_loss: 0.5699 - val_accuracy: 0.2435 - val_bce: 2.3272
[INFO] Model training completed in 0h 00min 46s

while running on Keras 3.3.3:

[...]
Epoch 10/10
102/102 ━━━━━━━━━━━━━━━━━━━━ 5s 50ms/step - accuracy: 0.2978 - bce: 1.7565 - d_loss: 0.5944 - g_loss: 1.5055 - g_lr: 3.9292e-04 - d_lr: 4.9445e-04 - val_accuracy: 0.3049 - val_bce: 1.7069 - val_d_loss: 0.6068 - val_g_loss: 0.9470
[INFO] Model training completed in 0h 00min 55s

passing from 46 seconds for the training on Keras 2 to 55 seconds on Keras 3 (+20% training time).

Repeating the exercise without passing any metrics (metrics=None in compile()) on Keras 2.14.0, we have:

[...]
Epoch 10/10
102/102 [==============================] - 3s 31ms/step - g_loss: 1.5659 - d_loss: 0.5860 - g_lr: 3.9292e-04 - d_lr: 4.9445e-04 - val_g_loss: 1.0507 - val_d_loss: 0.5937
[INFO] Model training completed in 0h 00min 34s

while running without any metrics on Keras 3.3.3:

[...]
Epoch 10/10
102/102 ━━━━━━━━━━━━━━━━━━━━ 4s 36ms/step - d_loss: 0.5683 - g_loss: 1.7626 - g_lr: 3.9292e-04 - d_lr: 4.9445e-04 - val_d_loss: 0.5440 - val_g_loss: 1.3810
[INFO] Model training completed in 0h 00min 40s

passing from 34 seconds for the training on Keras 2 to 40 seconds on Keras 3 (+18% training time).

mbarbetti commented 3 months ago

Repeating the exercise by taking scripts/train_ANN_isMuon.py as a reference for 10 epochs and with a dataset of 300000 instances, we observe similar performance. However, this script does not rely on a custom training procedure but it uses a PIDGAN model that is a simple wrap of the Keras Model class. This may exclude PIDGAN from being the source of this issue in favor of Keras 3 itself.

Test machine details: Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz (no GPU card equipped)

Launched command:

python train_ANN_isMuon.py -p pion -E 10 -C 300_000 -D 2016MU --test

Running on Keras 2.14.0:

[...]
Epoch 10/10
102/102 [==============================] - 1s 15ms/step - loss: 0.2095 - auc: 0.7671 - lr: 9.5631e-04 - val_loss: 0.2103 - val_auc: 0.7614
[INFO] Model training completed in 0h 00min 16s

while running on Keras 3.3.3:

[...]
Epoch 10/10
102/102 ━━━━━━━━━━━━━━━━━━━━ 2s 16ms/step - auc: 0.7592 - loss: 0.2191 - lr: 9.5631e-04 - val_auc: 0.7655 - val_loss: 0.2188
[INFO] Model training completed in 0h 00min 19s

passing from 16 seconds for the training on Keras 2 to 19 seconds on Keras 3 (+19% training time).

mbarbetti commented 3 months ago

Even repeating both the previous exercises (with scripts/train_ANN_isMuon.py and scripts/train_GAN_Rich.py) for 10 epochs and with a dataset of 300000 instances after having removed also the learning rate scheduling (callbacks=None in the fit() method), one observes the same drop of timing performance.

Test machine details: Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz (no GPU card equipped)

Launched command:

python train_ANN_isMuon.py -p pion -E 10 -C 300_000 -D 2016MU --test

Running on Keras 2.14.0:

[...]
Epoch 10/10
102/102 [==============================] - 1s 13ms/step - loss: 0.2196 - auc: 0.7632 - val_loss: 0.2164 - val_auc: 0.7646
[INFO] Model training completed in 0h 00min 15s

while running on Keras 3.3.3:

[...]
Epoch 10/10
102/102 ━━━━━━━━━━━━━━━━━━━━ 1s 15ms/step - auc: 0.7596 - loss: 0.2201 - val_auc: 0.7678 - val_loss: 0.2163
[INFO] Model training completed in 0h 00min 17s

passing from 15 seconds for the training on Keras 2 to 17 seconds on Keras 3 (+13% training time).

Launched command:

python train_GAN_Rich.py -p pion -E 10 -C 300_000 -D 2016MU --test

Running on Keras 2.14.0:

[...]
Epoch 10/10
102/102 [==============================] - 4s 40ms/step - g_loss: 1.6483 - d_loss: 0.5900 - accuracy: 0.2859 - bce: 2.0635 - val_g_loss: 0.9593 - val_d_loss: 0.6733 - val_accuracy: 0.2897 - val_bce: 2.0299
[INFO] Model training completed in 0h 00min 44s

while running on Keras 3.3.3:

[...]
Epoch 10/10
102/102 ━━━━━━━━━━━━━━━━━━━━ 5s 49ms/step - accuracy: 0.2851 - bce: 1.9503 - d_loss: 0.5958 - g_loss: 1.5373 - val_accuracy: 0.2896 - val_bce: 1.8913 - val_d_loss: 0.6007 - val_g_loss: 1.0399
[INFO] Model training completed in 0h 00min 54s

passing from 44 seconds for the training on Keras 2 to 54 seconds on Keras 3 (+23% training time).

mbarbetti commented 2 months ago

Following the suggestions of @fchollet, it seems that by playing with jit_compile=True Keras 3 outperforms Keras 2 in timing performance.

source: https://github.com/keras-team/keras/issues/19953#issuecomment-2210586395

mbarbetti / pidgan

PIDGAN shows different timing performance for different Keras versions #10