Open bberlo opened 3 years ago
Dear all,
A few days ago, instead of prematurely stopping the hyperparameter tuning programs, I ran the programs for every reg_combination (see Job Instantiation) from start to finish.
I have discovered that the option -m_n args.model_name in Job Instantiation causes every HyperbandSearchEdit object (see Setting up deep learning experiment) to save Keras Tuner specific information and logger information in a similar directory respectively (tuner specific and logger information is still saved into a different directory). This causes the following line to appear in the error stream: INFO:tensorflow:Reloading Oracle from existing project CONFIDENTIAL_DATA_PATH/widar_supervised_std/oracle.json.
After solving this issue by giving every program their own directory, I discovered that the hyperband tuner only continues indefinitely in the first bracket for reg_combinations 0-3. The hyperband tuner functions correctly for reg_combinations 4-7.
Therefore, I think the issue should be sought in either the number of defined hyperparameters in every HyperbandSearchEdit object, or the number of unique hyperparameter value combinations that can be made with the defined hyperparameters.
Regards, Bram van Berlo
Dear all,
I am currently running hyperparameter tuning programs on a high performance cluster, as part of deep learning experiments that I am working on. Unfortunately, the hyperband tuner that I am currently using continues indefinitely in the first bracket.
This conclusion can be drawn based on the fact that the number of trials in the first bracket should be 20 (max_epochs is 40 and tuner/epochs is 2). However, the first bracket runs for 25 trials in cluster job 3920641 (this approx. corresponds to the total number of unique hyperparameter combinations, i.e., [12, 24, 57] batch sizes and [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4] dropout rates). Afterwards, the program terminates with an exit code 0 (likely because the random search process in the first bracket ran out of unique hyperparameter combinations).
I am not sure if this issue is caused by my own programs, or by a bug. Therefore, I require some assistance from one of you. Can one of you please figure out what causes the hyperband tuner to run indefinitely and point me into the direction of a potential solution?
Thank you in advance for your effort.
Regards, Bram van Berlo
Output and error streams for cluster job 3920641
slurm-3920641-err.txt slurm-3920641-out.txt
Error stream for cluster job 3921521 (DEBUG run)
slurm-3921521-err.txt Note: output stream did not contain extra information compared to output stream of cluster job 3920641.
Job instantiation
Setting up deep learning experiment
ExtractorCNN submodel
CustomL2