keras-team / keras-tuner

A Hyperparameter Tuning Library for Keras
https://keras.io/keras_tuner/
Apache License 2.0
2.86k stars 396 forks source link

Hyperband tuner continues indefinitely in the first bracket #511

Open bberlo opened 3 years ago

bberlo commented 3 years ago

Dear all,

I am currently running hyperparameter tuning programs on a high performance cluster, as part of deep learning experiments that I am working on. Unfortunately, the hyperband tuner that I am currently using continues indefinitely in the first bracket.

This conclusion can be drawn based on the fact that the number of trials in the first bracket should be 20 (max_epochs is 40 and tuner/epochs is 2). However, the first bracket runs for 25 trials in cluster job 3920641 (this approx. corresponds to the total number of unique hyperparameter combinations, i.e., [12, 24, 57] batch sizes and [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4] dropout rates). Afterwards, the program terminates with an exit code 0 (likely because the random search process in the first bracket ran out of unique hyperparameter combinations).

I am not sure if this issue is caused by my own programs, or by a bug. Therefore, I require some assistance from one of you. Can one of you please figure out what causes the hyperband tuner to run indefinitely and point me into the direction of a potential solution?

Thank you in advance for your effort.

Regards, Bram van Berlo

Output and error streams for cluster job 3920641

slurm-3920641-err.txt slurm-3920641-out.txt

Error stream for cluster job 3921521 (DEBUG run)

slurm-3921521-err.txt Note: output stream did not contain extra information compared to output stream of cluster job 3920641.

Job instantiation

from Data_fetch_functions import fetch_widar_dfs_data
import sklearn.model_selection as sk
from simple_slurm import Slurm
import numpy as np
import argparse

# Command prompt settings for experiment automation
parser = argparse.ArgumentParser(description='Experiment automation setup script.')
parser.add_argument('-m_n', '--model_name', help='<Required> Set model name to be used in the experiment', required=True)
args = parser.parse_args()

# Fetch domain labels
domain_labels = fetch_widar_dfs_data()

# Label split object
k_fold_object = sk.KFold(n_splits=5, shuffle=True, random_state=42)

# Slurm cluster configuration
cluster_config_obj = Slurm(
    '--job_name', args.model_name,
    '--nodes', '1',
    '--ntasks', '8',
    '--partition', CONFIDENTIAL_PARTITION,
    '--error', 'slurm-%j.err',
    '--output', 'slurm-%j.out',
    '--time', '48:00:00',
    '--constraint', '2080ti'
    # '--dependency', 'singleton'  # To be used when running std, da_labeled, and da_unlabeled simultaneously
)

reg_combinations = [
    [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
    [0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0],
    [0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0],
    [0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0],
    [0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1]
]

# Random shuffle
# train_indices, test_indices = next(k_fold_object.split(X=domain_labels))

# Randomly distinguish between train/validate and test domains
domain_labels = np.argmax(domain_labels, axis=1) + 1
domain_types = np.arange(domain_labels.min(), domain_labels.max()+1)
train_type_indices, test_type_indices = next(k_fold_object.split(X=np.expand_dims(a=domain_types, axis=1)))
train_types, test_types = domain_types[train_type_indices], domain_types[test_type_indices]
train_indices, test_indices = \
    np.where(np.isin(domain_labels, test_elements=train_types))[0], \
    np.where(np.isin(domain_labels, test_elements=test_types))[0]

if args.model_name == "widar_supervised_std":
    for index, combination in enumerate(reg_combinations):
        cluster_config_obj.sbatch(run_cmd=' '.join([
            'python', args.model_name.capitalize() + '.py',
            '-e_s', '80',
            '-g', str(index),
            '-m_n', args.model_name,
            '-d_p', CONFIDENTIAL_DATA_PATH,
            '-t_i'] + list(map(str, train_indices.tolist()))
            + ['-r_i'] + list(map(str, combination))
        ), shell='/bin/bash')

elif args.model_name == "widar_supervised_da_labeled":
    for combination in reg_combinations:
        cluster_config_obj.sbatch(run_cmd=' '.join([
            'python', args.model_name.capitalize() + '.py',
            '-e_s', '80',
            '-m_n', args.model_name,
            '-d_p', CONFIDENTIAL_DATA_PATH,
            '-t_i'] + list(map(str, train_indices.tolist()))
            + ['-r_i'] + list(map(str, combination))
        ), shell='/bin/bash')

elif args.model_name == "widar_supervised_da_unlabeled":
    for combination in reg_combinations:
        cluster_config_obj.sbatch(run_cmd=' '.join([
            'python', args.model_name.capitalize() + '.py',
            '-e_s', '80',
            '-p_r', '100',
            '-q_r', '1000',
            '-m_n', args.model_name,
            '-d_p', CONFIDENTIAL_DATA_PATH,
            '-t_i'] + list(map(str, train_indices.tolist()))
            + ['-r_i'] + list(map(str, combination))
        ), shell='/bin/bash')

else:
    raise Exception("An unknown model experiment script was encountered")

Setting up deep learning experiment

import argparse
import tensorflow as tf
from distutils.util import strtobool
from kerastuner.tuners import hyperband
from models import ExtractorCNN
from Data_fetch_functions import fetch_widar_dfs_data
from kerastuner_tensorboard_logger import TensorBoardLogger, setup_tb

# Command prompt settings for experiment automation
parser = argparse.ArgumentParser(description='Experiment automation setup script.')
parser.add_argument('-e_s', '--epoch_size', type=int, help='<Required> Set epoch size to be used in the experiment', required=True)
parser.add_argument('-g', '--gpu', type=int, help='<Required> GPU to be used in the experiment', required=True)
parser.add_argument('-m_n', '--model_name', help='<Required> Set model name to be used in the experiment', required=True)
parser.add_argument('-d_p', '--data_path', help='<Required> Path to data directory', required=True)
parser.add_argument('-t_i', '--train_instances', type=int, nargs='+', help='<Required> Train instances to be used in the experiment', required=True)
parser.add_argument('-r_i', '--reg_instances', type=strtobool, nargs='+', help='<Required> Defines which regularization layers to activate', required=True)
args = parser.parse_args()

# GPU config. for allocating limited amount of memory on a given device
gpus = tf.config.list_physical_devices("GPU")
if gpus:
    try:
        tf.config.experimental.set_memory_growth(gpus[args.gpu], True)
        tf.config.experimental.set_visible_devices(gpus[args.gpu], "GPU")
        """
        tf.config.experimental.set_virtual_device_configuration(
            gpus[0],
            [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=9216)]
        )
        """
    except RuntimeError as e:
        print(e)

# Set logging level
tf.get_logger().setLevel("ERROR")

# Load training and validation data
x_train, x_eval, y_train, y_eval, _, _ = fetch_widar_dfs_data(args.train_instances)

callback_objects = [
    tf.keras.callbacks.EarlyStopping(monitor="val_loss", min_delta=0, patience=5, restore_best_weights=True)
]

# ---------- STANDARD TRAINING WITH DFS DATA ---------------------------------------------------------------------

def build_model(hp):
    kernel_initializers = 'he_uniform'

    cnn_extractor = ExtractorCNN.ExtractorCNN(hp, list(map(bool, args.reg_instances)), kernel_initializers).get_model()
    inp = tf.keras.layers.Input(shape=(121, 2000, 6))
    enc_o = cnn_extractor(inp)

    x = tf.keras.layers.Dense(
        64,
        activation="relu", kernel_initializer=kernel_initializers)(enc_o)
    o = tf.keras.layers.Dense(y_train.shape[-1], activation="softmax", kernel_initializer=kernel_initializers, name="output")(x)

    complete_model = tf.keras.models.Model(inp, o, name=args.model_name)
    complete_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
                           loss="categorical_crossentropy")
    return complete_model

# ----------------------------------------------------------------------------------------------------------------

# Hyperparameter tune and extract result logs to directory/project_name
class HyperbandSearchEdit(hyperband.Hyperband):
    def run_trial(self, trial, *fit_args, **fit_kwargs):
        fit_kwargs['batch_size'] = trial.hyperparameters.Choice('batch_size', values=[12, 24, 57])
        super(HyperbandSearchEdit, self).run_trial(trial, *fit_args, **fit_kwargs)

tuner = HyperbandSearchEdit(
    hypermodel=build_model,
    objective='val_loss',
    max_epochs=40,
    factor=3,
    hyperband_iterations=1,
    seed=42,
    directory=args.data_path,
    project_name=args.model_name,
    logger=TensorBoardLogger(
        metrics=["val_loss"],
        logdir='results/' + args.model_name + '-hparams'
    )
)
setup_tb(tuner)
tuner.search(x=x_train, y=y_train,
             verbose=1, callbacks=callback_objects, validation_data=(x_eval, y_eval),
             validation_freq=1, shuffle=True)

ExtractorCNN submodel

from tensorflow import keras
from custom_items.utilities import CustomL2

class ExtractorCNN:
    def __init__(self, hp, reg_instances, kernel_initializers, input_shape=(121, 2000, 6)):
        self.input_shape = input_shape
        self.hp = hp
        self.kernel_initializers = kernel_initializers
        self.reg_instances = reg_instances

    def get_model(self):
        inp = keras.layers.Input(shape=self.input_shape, name="channel_wise_dfs_input")

        # --------- Regularization decisions --------- #
        inp_conv_1_bn = self.hp.Fixed('inp_conv_1_bn', value=self.reg_instances[0])
        inp_conv_1_l2 = self.hp.Fixed('inp_conv_1_l2', value=self.reg_instances[1])
        inp_conv_1_do = self.hp.Fixed('inp_conv_1_do', value=self.reg_instances[2])
        inp_conv_1_do_rate = self.hp.Choice('inp_conv_1_do_rate',
            values=[0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4],
            parent_name='inp_conv_1_do',
            parent_values=[True])

        inp_conv_2_bn = self.hp.Fixed('inp_conv_2_bn', value=self.reg_instances[3])
        inp_conv_2_l2 = self.hp.Fixed('inp_conv_2_l2', value=self.reg_instances[4])
        inp_conv_2_do = self.hp.Fixed('inp_conv_2_do', value=self.reg_instances[5])
        inp_conv_2_do_rate = self.hp.Choice('inp_conv_2_do_rate',
            values=[0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4],
            parent_name='inp_conv_2_do',
            parent_values=[True])

        inp_conv_3_bn = self.hp.Fixed('inp_conv_3_bn', value=self.reg_instances[6])
        inp_conv_3_l2 = self.hp.Fixed('inp_conv_3_l2', value=self.reg_instances[7])
        inp_conv_3_do = self.hp.Fixed('inp_conv_3_do', value=self.reg_instances[8])
        inp_conv_3_do_rate = self.hp.Choice('inp_conv_3_do_rate',
            values=[0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4],
            parent_name='inp_conv_3_do',
            parent_values=[True])

        inp_conv_4_bn = self.hp.Fixed('inp_conv_4_bn', value=self.reg_instances[9])
        inp_conv_4_l2 = self.hp.Fixed('inp_conv_4_l2', value=self.reg_instances[10])
        inp_conv_4_do = self.hp.Fixed('inp_conv_4_do', value=self.reg_instances[11])
        inp_conv_4_do_rate = self.hp.Choice('inp_conv_4_do_rate',
            values=[0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4],
            parent_name='inp_conv_4_do',
            parent_values=[True])

        # --------- Convolution layers --------- #
        inp_conv_1 = keras.layers.Conv2D(
            filters=120,
            kernel_size=(72, 64),
            strides=(2, 2),
            activation=None,
            padding="same",
            kernel_initializer=self.kernel_initializers,
            kernel_regularizer=CustomL2(
                l2=self.hp.Choice('inp_conv_1_l2_rate', values=[0.0001, 0.0005, 0.00001, 0.00005],
                                  parent_name='inp_conv_1_l2', parent_values=[True])),
            use_bias=False,
            name="channel_wise_dfs_conv1"
        )

        inp_conv_2 = keras.layers.Conv2D(
            filters=16,
            kernel_size=(88, 64),
            strides=(6, 1),
            activation=None,
            padding="same",
            kernel_initializer=self.kernel_initializers,
            kernel_regularizer=CustomL2(
                l2=self.hp.Choice('inp_conv_2_l2_rate', values=[0.0001, 0.0005, 0.00001, 0.00005],
                                  parent_name='inp_conv_2_l2', parent_values=[True])),
            use_bias=False,
            name="channel_wise_dfs_conv2"
        )

        inp_conv_3 = keras.layers.Conv2D(
            filters=248,
            kernel_size=(80, 24),
            strides=(2, 2),
            activation=None,
            padding="same",
            kernel_initializer=self.kernel_initializers,
            kernel_regularizer=CustomL2(
                l2=self.hp.Choice('inp_conv_3_l2_rate', values=[0.0001, 0.0005, 0.00001, 0.00005],
                                  parent_name='inp_conv_3_l2', parent_values=[True])),
            use_bias=False,
            name="channel_wise_dfs_conv3"
        )

        inp_conv_4 = keras.layers.Conv2D(
            filters=80,
            kernel_size=(52, 42),
            strides=(3, 5),
            activation=None,
            padding="same",
            kernel_initializer=self.kernel_initializers,
            kernel_regularizer=CustomL2(
                l2=self.hp.Choice('inp_conv_4_l2_rate', values=[0.0001, 0.0005, 0.00001, 0.00005],
                                  parent_name='inp_conv_4_l2', parent_values=[True])),
            use_bias=False,
            name="channel_wise_dfs_conv4"
        )

        # --------- Encoder --------- #
        enc = inp_conv_1(inp)
        enc = keras.layers.Activation("relu")(enc)
        enc = keras.layers.MaxPool2D(
            pool_size=(10, 20),
            strides=(12, 16),
            padding="same")(enc)
        if inp_conv_1_do:
            enc = keras.layers.Dropout(inp_conv_1_do_rate)(enc)
        if inp_conv_1_bn:
            enc = keras.layers.BatchNormalization()(enc)

        enc = inp_conv_2(enc)
        enc = keras.layers.Activation("relu")(enc)
        enc = keras.layers.MaxPool2D(
            pool_size=(50, 14),
            strides=(18, 20),
            padding="same")(enc)
        if inp_conv_2_do:
            enc = keras.layers.Dropout(inp_conv_2_do_rate)(enc)
        if inp_conv_2_bn:
            enc = keras.layers.BatchNormalization()(enc)

        enc = inp_conv_3(enc)
        enc = keras.layers.Activation("relu")(enc)
        if inp_conv_3_do:
            enc = keras.layers.Dropout(inp_conv_3_do_rate)(enc)
        if inp_conv_3_bn:
            enc = keras.layers.BatchNormalization()(enc)

        enc = inp_conv_4(enc)
        enc = keras.layers.Activation("relu")(enc)
        if inp_conv_4_do:
            enc = keras.layers.Dropout(inp_conv_4_do_rate)(enc)
        if inp_conv_4_bn:
            enc = keras.layers.BatchNormalization()(enc)

        gmp_enc = keras.layers.GlobalMaxPooling2D()(enc)

        return keras.models.Model(inp, gmp_enc)

CustomL2

# Custom L2 constraint class to set L2 weight to 0. when None
class CustomL2(tf.keras.regularizers.l2):
    def __init__(self, l2=0.01, **kwargs):
        super(CustomL2, self).__init__(0., **kwargs)

        l2 = kwargs.pop('l', l2)  # Backwards compatibility
        if kwargs:
            raise TypeError('Argument(s) not recognized: %s' % (kwargs,))

        l2 = 0. if l2 is None else l2
        _check_penalty_number(l2)
        self.l2 = tf.keras.backend.cast_to_floatx(l2)
bberlo commented 3 years ago

Dear all,

A few days ago, instead of prematurely stopping the hyperparameter tuning programs, I ran the programs for every reg_combination (see Job Instantiation) from start to finish.

I have discovered that the option -m_n args.model_name in Job Instantiation causes every HyperbandSearchEdit object (see Setting up deep learning experiment) to save Keras Tuner specific information and logger information in a similar directory respectively (tuner specific and logger information is still saved into a different directory). This causes the following line to appear in the error stream: INFO:tensorflow:Reloading Oracle from existing project CONFIDENTIAL_DATA_PATH/widar_supervised_std/oracle.json.

After solving this issue by giving every program their own directory, I discovered that the hyperband tuner only continues indefinitely in the first bracket for reg_combinations 0-3. The hyperband tuner functions correctly for reg_combinations 4-7.

Therefore, I think the issue should be sought in either the number of defined hyperparameters in every HyperbandSearchEdit object, or the number of unique hyperparameter value combinations that can be made with the defined hyperparameters.

Regards, Bram van Berlo