keras-team / autokeras

AutoML library for deep learning
http://autokeras.com/
Apache License 2.0
9.11k stars 1.39k forks source link

Bug: If best_loss is nan in trial#1, then the best_loss_so_far will not update when best_loss is normal in the later trial. #1758

Open jason022085 opened 2 years ago

jason022085 commented 2 years ago

Bug Description

If best_loss is nan in trial#1, then the best_loss_so_far will not update when best_loss is normal in the later trial

Bug Reproduction

Code for reproducing the bug:

automodel = ak.TimeseriesForecaster(
                lookback=1,
                predict_from=1, 
                predict_until=None,
                max_trials=max_trials, 
                max_model_size = 10**8,
                tuner='bayesian',
                metrics = metrics,
                objective="val_loss",
                overwrite=True,
                directory = './',
                project_name = "DL")

cb_list = [ keras.callbacks.CSVLogger("./history.csv", separator=',', append=True),
                 keras.callbacks.TerminateOnNaN()]

automodel.fit(x=x_train, y=y_train, validation_data = (x_val, y_val), batch_size=1, epochs=100, callbacks = cb_list)

Data used by the code (only show 5 features):

image

Expected Behavior

loss in tiral 2 should replace the best_loss_so_far

Setup Details

Include the details about the versions of:

Additional context

Search: Running Trial #1

Hyperparameter    |Value             |Best Value So Far
timeseries_bloc...|True              |?
timeseries_bloc...|gru               |?
timeseries_bloc...|3                 |?
regression_head...|0                 |?
optimizer         |sgd               |?
learning_rate     |0.1               |?
Epoch 8/100
Batch 0: Invalid loss, terminating training
2/2 [==============================] - 1s 632ms/step - loss: inf - mean_absolute_error: 1744884913465074909184.0000 - val_loss: nan - val_mean_absolute_error: nan
/home/b00175/.local/lib/python3.6/site-packages/keras_tuner/engine/metrics_tracking.py:85: RuntimeWarning: All-NaN axis encountered
  return np.nanmin(values)
Trial 1 Complete [00h 00m 31s]
val_loss: nan

Best val_loss So Far: nan
Total elapsed time: 00h 00m 31s

Search: Running Trial #2

Hyperparameter    |Value             |Best Value So Far
timeseries_bloc...|True              |True
timeseries_bloc...|gru               |gru
timeseries_bloc...|2                 |3
regression_head...|0.25              |0
optimizer         |adam              |sgd
learning_rate     |2e-05             |0.1
Trial 2 Complete [00h 00m 39s]
val_loss: 0.011794866994023323

Best val_loss So Far: nan
Total elapsed time: 00h 01m 11s

Search: Running Trial #3

Hyperparameter    |Value             |Best Value So Far
timeseries_bloc...|True              |True
timeseries_bloc...|gru               |gru
timeseries_bloc...|2                 |3
regression_head...|0.25              |0
optimizer         |adam_weight_decay |sgd
learning_rate     |0.001             |0.1
Trial 3 Complete [00h 00m 36s]
val_loss: 0.058854274451732635

Best val_loss So Far: nan
Total elapsed time: 00h 01m 47s

ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./code/upload/report_hi/Group_61/DL/trial_4f14555e0cfc07cd6667f708be62af02/checkpoints                             /epoch_None/checkpoint: Not found: ./code/upload/report_hi/Group_61/DL/trial_4f14555e0cfc07cd6667f708be62af02/checkpoints/epoch_None; No such file or directory
haifeng-jin commented 1 year ago

This might be a bug in KerasTuner. I will investigate into it.