"WARNING:Callback method 'on_train_batch_end' is slow compared to the batch time" when no callback activated and training slowed down

DanielYang59 commented 1 year ago

System information.

Have I written custom code (as opposed to using a stock example script provided in Keras): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Rocky Linux 8.7 (Australian NCI HPC Gadi)
TensorFlow installed from (source or binary): Unknown (provided by HPC as module)
TensorFlow version (use command below): v2.9.2-107-ga5ed5f39b67 2.9.3
Python version: Python 3.9.12 (main, Jun 1 2022, 11:38:51)
Bazel version (if compiling from source): NA
GPU model and memory: NVIDIA A100 32GB
Exact command to reproduce:

Describe the problem.

Get WARNING:tensorflow:Callback method "on_train_batch_end" is slow compared to the batch time (batch time: 0.1608s vs "on_train_batch_end" time: 0.2945s). Check your callbacks. warning when no callbacks was set.

Training is significantly slowed down and training time varies randomly between trials.

When training with model.fit(), the training time for 1st epoch is around 17s, the following epochs around 13s.
When doing hyperparameter search with keras_tuner.Hyperband().search(), training time of 1st epoch varies between 17s to 98s (behave quite randomly, some get ~20s, some ~50s, some go up to ~100s), and the same for other epochs. Details could be seen in the enclosed "tunerlog".

Describe the current behavior. Training is significantly slowed down and training time varies significantly between trials.

Describe the expected behavior. Training slow should be stable and no significant variance in training time is expected between epochs.

Contributing.

Do you want to contribute a PR? (yes/no): no

Standalone code to reproduce the issue.

# Generate dataset
dataset = tf.data.Dataset.from_tensor_slices((feature, label))
dataset = dataset.shuffle(buffer_size=total_sample, reshuffle_each_iteration=False)
train_set = dataset.take(train_size)
val_set = dataset.skip(train_size)

train_set = train_set.batch(batch_size=batch_size)
train_set = train_set.prefetch(tf.data.AUTOTUNE)
val_set = val_set.batch(batch_size)
val_set = val_set.prefetch(tf.data.AUTOTUNE)

# Hyper Tuning with Keras Tuner
from hp_model import hp_model
tuner = keras_tuner.Hyperband(
    hypermodel=hp_model,
    max_epochs=150,
    factor=3,
    overwrite=False,
    objective="val_mean_absolute_error",
    directory="hp_search",
    )

tuner.search(train_set, validation_data=val_set,
             epochs=1000,
             verbose=2,
             )

In the "hp_model", a hypermodel with eight hyperparameters is defined (should I search so many parameters at the same time?) like this, the complete source code is enclosed as "hp_model.py":

# Master Layer
hp_master_1st_dense_units = hp.Choice("hp_master_1st_dense_units", [64, 128, 256, 512, 1024])
hp_master_2nd_dense_units = hp.Choice("hp_master_2nd_dense_units", [64, 128, 256, 512, 1024])
hp_master_3rd_dense_layer = hp.Boolean("hp_master_3rd_dense_layer", default=False)
hp_master_activation_function = hp.Choice("hp_master_act_func", ["tanh", "relu", "sigmoid"])

# Branch
hp_branch_dense_activation_func = hp.Choice("hp_branch_dense_activation_func", ["tanh", "relu", "sigmoid"]) 
hp_numFilters = hp.Int("hp_numFilters", min_value=2, max_value=128, sampling="log")
hp_branch_kernel_size = hp.Int("hp_branch_kernel_size", min_value=2, max_value=32, step=2) 
hp_branch_dense_units = hp.Choice("hp_branch_dense_units", [16, 32, 64, 128, 256, 512])

Source code / logs.

This is the log file for training process tunerlog.txt.

Here is the source code for the hyper model and tuning process src.zip

sushreebarsa commented 1 year ago

@HaoyuYang59 I could run the source code successfully on colab using TF v2.9 and TF v2.11, please find the attached gists. Could you let us know if I am missing something to reproduce the reported issue. Thank you!

DanielYang59 commented 1 year ago

Hi @sushreebarsa , thanks for following up.

I realized yesterday that this might not be an issue with Keras Tuner, instead it seems to be expected as I was adjusting the number of ConV layers during tuning, as a result variance in training time should be normal, if I understand correctly?

Thanks for your time and wishing you all the best.

Regards, Haoyu

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No

keras-team / tf-keras

"WARNING:Callback method 'on_train_batch_end' is slow compared to the batch time" when no callback activated and training slowed down #268