GPU issues? - Githubissues

Feheragyar commented 1 year ago

I have been using your extension on CPUs and it runs perfectly. I recently moved over to a using a GPU and the loss calculation looks completely chaotic now. Are there some issues in the implementation that prohibit the use of GPUs? Here is a snippet for you to see the loss calculation issues (Best loss is 'none'; The recovered weights result in previously unseen loss values after early stopping; and due to 'none' best loss value the Best hyperparameters remain as the were set for the very first trial):

Inner Cross-Validation 5/5

Epoch 1/50 6/6 [==============================] - 5s 575ms/step - loss: 0.5369 - mean_squared_error: 0.5369 - mean_absolute_error: 0.6359 - mean_absolute_percentage_error: 263.9126 - root_mean_squared_error: 0.7327 - val_loss: 0.0721 - val_mean_squared_error: 0.0721 - val_mean_absolute_error: 0.2148 - val_mean_absolute_percentage_error: 22.1264 - val_root_mean_squared_error: 0.2685 Epoch 2/50 6/6 [==============================] - 3s 475ms/step - loss: 0.1652 - mean_squared_error: 0.1652 - mean_absolute_error: 0.3106 - mean_absolute_percentage_error: 323.5719 - root_mean_squared_error: 0.4065 - val_loss: 0.0850 - val_mean_squared_error: 0.0850 - val_mean_absolute_error: 0.2492 - val_mean_absolute_percentage_error: 25.4391 - val_root_mean_squared_error: 0.2915 Epoch 3/50 6/6 [==============================] - 3s 478ms/step - loss: 0.1079 - mean_squared_error: 0.1079 - mean_absolute_error: 0.2405 - mean_absolute_percentage_error: 256.0751 - root_mean_squared_error: 0.3284 - val_loss: 0.0103 - val_mean_squared_error: 0.0103 - val_mean_absolute_error: 0.0714 - val_mean_absolute_percentage_error: 7.3397 - val_root_mean_squared_error: 0.1013 Epoch 4/50 6/6 [==============================] - 3s 478ms/step - loss: 0.1035 - mean_squared_error: 0.1035 - mean_absolute_error: 0.1980 - mean_absolute_percentage_error: 354.6868 - root_mean_squared_error: 0.3217 - val_loss: 0.0538 - val_mean_squared_error: 0.0538 - val_mean_absolute_error: 0.2179 - val_mean_absolute_percentage_error: 22.2260 - val_root_mean_squared_error: 0.2319 Epoch 5/50 6/6 [==============================] - 3s 481ms/step - loss: 0.1149 - mean_squared_error: 0.1149 - mean_absolute_error: 0.2556 - mean_absolute_percentage_error: 254.6845 - root_mean_squared_error: 0.3389 - val_loss: 0.0229 - val_mean_squared_error: 0.0229 - val_mean_absolute_error: 0.1178 - val_mean_absolute_percentage_error: 12.0714 - val_root_mean_squared_error: 0.1513 Epoch 6/50 6/6 [==============================] - 2s 381ms/step - loss: 0.0978 - mean_squared_error: 0.0978 - mean_absolute_error: 0.2223 - mean_absolute_percentage_error: 208.5932 - root_mean_squared_error: 0.3127 - val_loss: 0.0734 - val_mean_squared_error: 0.0734 - val_mean_absolute_error: 0.2140 - val_mean_absolute_percentage_error: 22.2007 - val_root_mean_squared_error: 0.2710 Epoch 7/50 6/6 [==============================] - 1s 225ms/step - loss: 0.0789 - mean_squared_error: 0.0789 - mean_absolute_error: 0.2038 - mean_absolute_percentage_error: 213.5430 - root_mean_squared_error: 0.2808 - val_loss: 0.0186 - val_mean_squared_error: 0.0186 - val_mean_absolute_error: 0.0969 - val_mean_absolute_percentage_error: 10.0373 - val_root_mean_squared_error: 0.1364 Epoch 8/50 6/6 [==============================] - 1s 228ms/step - loss: 0.0708 - mean_squared_error: 0.0708 - mean_absolute_error: 0.1652 - mean_absolute_percentage_error: 276.1188 - root_mean_squared_error: 0.2662 - val_loss: 0.0087 - val_mean_squared_error: 0.0087 - val_mean_absolute_error: 0.0701 - val_mean_absolute_percentage_error: 7.1587 - val_root_mean_squared_error: 0.0935 Epoch 9/50 6/6 [==============================] - 1s 219ms/step - loss: 0.0676 - mean_squared_error: 0.0676 - mean_absolute_error: 0.1503 - mean_absolute_percentage_error: 282.9794 - root_mean_squared_error: 0.2600 - val_loss: 0.0090 - val_mean_squared_error: 0.0090 - val_mean_absolute_error: 0.0536 - val_mean_absolute_percentage_error: 5.5848 - val_root_mean_squared_error: 0.0950 Epoch 10/50 6/6 [==============================] - 2s 409ms/step - loss: 0.0663 - mean_squared_error: 0.0663 - mean_absolute_error: 0.1536 - mean_absolute_percentage_error: 242.2759 - root_mean_squared_error: 0.2574 - val_loss: 0.0151 - val_mean_squared_error: 0.0151 - val_mean_absolute_error: 0.0738 - val_mean_absolute_percentage_error: 7.7006 - val_root_mean_squared_error: 0.1227 Epoch 11/50 6/6 [==============================] - 3s 481ms/step - loss: 0.0696 - mean_squared_error: 0.0696 - mean_absolute_error: 0.1742 - mean_absolute_percentage_error: 183.5706 - root_mean_squared_error: 0.2638 - val_loss: 0.0395 - val_mean_squared_error: 0.0395 - val_mean_absolute_error: 0.1167 - val_mean_absolute_percentage_error: 12.3000 - val_root_mean_squared_error: 0.1986 Epoch 12/50 6/6 [==============================] - 2s 269ms/step - loss: 0.0635 - mean_squared_error: 0.0635 - mean_absolute_error: 0.1620 - mean_absolute_percentage_error: 193.5781 - root_mean_squared_error: 0.2520 - val_loss: 0.0258 - val_mean_squared_error: 0.0258 - val_mean_absolute_error: 0.0838 - val_mean_absolute_percentage_error: 8.8847 - val_root_mean_squared_error: 0.1606 Epoch 13/50 6/6 [==============================] - 2s 409ms/step - loss: 0.0594 - mean_squared_error: 0.0594 - mean_absolute_error: 0.1509 - mean_absolute_percentage_error: 208.7011 - root_mean_squared_error: 0.2438 - val_loss: 0.0404 - val_mean_squared_error: 0.0404 - val_mean_absolute_error: 0.1378 - val_mean_absolute_percentage_error: 14.4424 - val_root_mean_squared_error: 0.2011 Restoring model weights from the end of the best epoch. Epoch 00013: early stopping 1/1 [==============================] - 1s 579ms/step 1/1 [==============================] - 0s 500ms/step 1/1 [==============================] - 1s 1s/step - loss: 0.0499 - mean_squared_error: 0.0499 - mean_absolute_error: 0.1130 - mean_absolute_percentage_error: 234.8392 - root_mean_squared_error: 0.2234 1/1 [==============================] - 1s 609ms/step - loss: 0.1864 - mean_squared_error: 0.1864 - mean_absolute_error: 0.2046 - mean_absolute_percentage_error: 106.4081 - root_mean_squared_error: 0.4317 Trial 1 Complete [00h 02m 55s]

Best val_loss So Far: None Total elapsed time: 00h 02m 55s

VZoche-Golob commented 1 year ago

When using the version included in pull request #5, I did not run into any issues during hyperparameter optimization on a GPU. Did you use the exact same code on a CPU and on a GPU?

Feheragyar commented 1 year ago

Yes I used the #5 version. I ran the identical code on CPU and GPU.

VZoche-Golob commented 1 year ago

Unfortunately, I cannot reproduce your issues. Could you please provide a hypermodel (e.g. for MNIST) and the HPO and training procedure as code snippet (e.g. in a gist) that produces the issue?

Feheragyar commented 1 year ago

Here is the Gist for the full code I am running (It's an LSTM tuning via ByasianOptimization) along with the Data I use for training. The code directly grabs the numpy format data, so you can just run the code directly.

Thanks a lot for the help! Hope you can see the issues this way.

VZoche-Golob commented 1 year ago

Using your data, I tried to reproduce your issues again after minor modifications of your code (Gist) and in branch fixGPUissues. I used the same computer with Tensorflow version 2.11.0 and Keras-Tuner version 1.1.3.

Again, I could not reproduce your issues (see out-files in Gist):

Never occurred a "Best val_loss So Far: None" - I have no idea where it might come from. According to the keras-tuner code, it would only be printed if no trials were completed.
val_loss of a cv-split within a trial was always exactly the best epoch's val_loss.

However, I used batch_size='full-batch' to ensure that keras-tuner-cv used the same batch sizes during training and evaluation of a cv-split in a trial. Please be aware, that inner_cv() of keras-tuner-cv always uses the length of the training and the validation data, respectively, when evaluating the trained model in a cv-split.

Feheragyar commented 1 year ago

Thank you for the help! I will look around my environment. There must be something disagreeing with your package.

Feheragyar commented 1 year ago

Which version of tensorflow-gpu is recommended to use with keras-tuner-cv?

VZoche-Golob commented 1 year ago

I did not test it with another version than Tensorflow 2.11.

Feheragyar commented 1 year ago

So you run the script on Linux or WSL? Perhaps, that's my issue. Running it on native Win.

VZoche-Golob commented 1 year ago

Using WSL2.

VZoche-Golob commented 1 year ago

It seems that I get the same issue after an update from TensorFlow 2.11 and Kerastuner 1.1.3 to TensorFlow 2.12 and Kerastuner 1.3.5. @Feheragyar : How did you handle this issue?

Feheragyar commented 1 year ago

I have simply migrated to Linux. Gave up on virtual environments as I couldn't make the library run in a full day of fiddlin'. I used the TF and tuner versions cited by you in a previous comment (Tensorflow version 2.11.0 and Keras-Tuner version 1.1.3.). I had no issues on native Linux (Ubuntu) using anaconda.

VZoche-Golob commented 1 year ago

@Feheragyar : Thanks for answering so quickly. Most probably, that will be my solution as well...

Using TensorFlow 2.12 and Kerastuner 1.3.5, test_randomsearch, test_bayesianoptimization and test_hyperband in keras_tuner_cv/test_inner_cv.py (https://github.com/VZoche-Golob/keras-tuner-cv) fail.

Feheragyar commented 1 year ago

No worries. I believe that's all I did, let me know if you run into troubles I'll try to retrace my steps for you. I believe I tested it with the most up to date TF, and kept the old tuner version and it still worked perfectly.

VZoche-Golob commented 1 year ago

I tried different versions of Tensorflow and KerasTuner. It seems that keras-tuner-cv currently only works with KerasTuner 1.1.3 and numpy 1.20.

When using KerasTuner 1.1.3 with Tensorflow >2.11, you will get several deprecation warnings. However, even with Tensorflow 2.11, KerasTuner 1.1.3 and numpy 1.20, you get:

lib/python3.9/site-packages/keras_tuner/tuners/bayesian.py:123: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

VZoche-Golob commented 1 year ago

I think, I found the issue: in KerasTuner 1.1.3, the status of a trail was set to "completed" by Oracle.end_trial() but in v1.3.5, the status is set earlier by BaseTuner._try_run_and_update_trial() which did not exist in v1.1.3.

VZoche-Golob commented 1 year ago

I fixed the issue for KerasTuner 1.3.5 in https://github.com/VZoche-Golob/keras-tuner-cv

VZoche-Golob commented 1 year ago

After merging https://github.com/giuseppegrieco/keras-tuner-cv/pull/5, this issue should be fixed.

giuseppegrieco / keras-tuner-cv

GPU issues? #8

Inner Cross-Validation 5/5