` val_loss` is not available in training after some epochs

cb0s commented 2 years ago

Please go to TF Forum for help and support:

https://discuss.tensorflow.org/tag/keras

If you open a GitHub issue, here is our policy:

It must be a bug, a feature request, or a significant problem with the documentation (for small docs fixes please send a PR instead). The form below must be filled out.

Here's why we have that policy:.

Keras developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information.

Have I written custom code (as opposed to using a stock example script provided in Keras): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 - 21H2
TensorFlow installed from (source or binary): pip
TensorFlow version (use command below): v2.8.0-rc1-32-g3f878cff5b6 2.8.0
Python version: 3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)]
Bazel version (if compiling from source): N/A
GPU model and memory: GTX 1080 - 8 GB
Exact command to reproduce: N/A

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the problem.

This issue first appeared here as issue 55060.

I train a multi-scale deep convolutional autoencoder. After some epochs val_loss cannot be identified by tensorflow callbacks any more: _WARNING:tensorflow:Learning rate reduction is conditioned on metric val_loss which is not available. Available metrics are: loss,accuracy,lr WARNING:tensorflow:Early stopping conditioned on metric val_loss which is not available. Available metrics are: loss,accuracy,lr_

My callbacks are:

    callbacks = [tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss",
                                                      factor=0.5, patience=8, min_lr=0.00001),
                 tf.keras.callbacks.ModelCheckpoint("models/ms-model1-1", monitor="val_loss"),
                 tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=25),
                 tf.keras.callbacks.TensorBoard(log_dir="train/ms-log1-1/", histogram_freq=1),
                 tf.keras.callbacks.ModelCheckpoint('models/ms-models-epochs/model{epoch:08d}', period=5)]

The last ModelCheckpoint callback I added to work around this bug. This sadly means that I have to put more work into it, in order for it to work as I have to reduce the LR manually.

In tensorboard it looks like this: grafik

You can see that the val_loss is not even calculated any more. This happens after epoch 4, so after the ModelCheckpoint was first reached.

Describe the expected behavior.

The expected behaviour would be that val_loss is further calculated and can therefore also be used in the ModelCheckpoints, EarlyStoppings and other callbacks I use.

I am not entirely sure if it is me using the library wrong or whether it really fails due to a bug (I am by no means an expert in TF). The code for the network generation is also included in the PR, the data sets are too big though and are not online available. The used data are Copernicus 4 channel 256x256 images. This is why I did not include the dataloading code as it seems irrelevant to me (I can include it though if you need it).

Contributing.

Do you want to contribute a PR? (yes/no): no
If yes, please read this page for instructions
Briefly describe your candidate solution(if contributing): n/a

Standalone code to reproduce the issue.

Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Source code / logs.

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

def create_normalized_convolution(filters: int,
                                  activation=None,
                                  strides=1,
                                  kernel_size=3,
                                  padding='SAME',
                                  seq_name=None,
                                  **kwargs) -> tf.keras.Sequential:
    if activation is None:
        activation = LeakyReLU()
    return tf.keras.Sequential([
        Conv2D(filters=filters, strides=strides, kernel_size=kernel_size, padding=padding, **kwargs),
        BatchNormalization(),
        activation
    ], seq_name)

def create_normalized_convolution_transpose(filters: int,
                                            activation=None,
                                            strides=1,
                                            kernel_size=3,
                                            padding='SAME',
                                            seq_name=None,
                                            **kwargs) -> tf.keras.Sequential:
    if activation is None:
        activation = LeakyReLU()
    return tf.keras.Sequential([
        Conv2DTranspose(filters=filters, strides=strides, kernel_size=kernel_size, padding=padding, **kwargs),
        BatchNormalization(),
        activation
    ], name=seq_name)

class BaseAutoencoder(tf.keras.Model):
    """
    Base class for all used AEs.
    The advantage of a base class is that it only requires you to specify the network structure in the constructor and
    removes almost all boilerplate code.
    """

    def __init__(self, latent_dim: int, image_size=256, channels=4):
        super(BaseAutoencoder, self).__init__()

        self._latent_dim = latent_dim
        self._image_size = image_size
        self._channels = channels
        self._encoder: tf.keras.Sequential = None
        self._decoder: tf.keras.Sequential = None

    def get_latent_dim(self) -> int:
        return self._latent_dim

    def get_image_size(self) -> int:
        return self._image_size

    def get_channels(self) -> int:
        return self._channels

    def call(self, inputs):
        bottleneck = self._encoder(inputs)
        return self._decoder(bottleneck)

    def encoder(self) -> tf.keras.Sequential:
        return self._encoder

    def decoder(self) -> tf.keras.Sequential:
        return self._decoder

class MultiscaleAutoencoder(BaseAutoencoder):
    """
    A custom multiscale implementation of an AutoEncoder.
    """

    def __init__(self, latent_dim: int, image_size=256, channels=4, base_filter_count=32):
        super(MultiscaleAutoencoder, self).__init__(latent_dim, image_size, channels)
        self._base_filter_count = base_filter_count

        raw_inp = Input(shape=(image_size, image_size, channels), dtype=tf.float32)
        enc_inp = InputLayer(name='Input-ENCODER_0')(raw_inp)

        high_res_enc = tf.keras.Sequential([
            create_normalized_convolution(filters=base_filter_count,
                                          strides=2,
                                          seq_name="NormConv-LeakyReLU-HR_ENCODER_1"),
            create_normalized_convolution(filters=base_filter_count * 2,
                                          strides=2,
                                          seq_name="NormConv-LeakyReLU-HR_ENCODER_2"),
            create_normalized_convolution(filters=base_filter_count * 4,
                                          strides=2,
                                          seq_name="NormConv-LeakyReLU-HR_ENCODER_3"),
            create_normalized_convolution(filters=base_filter_count * 8,
                                          strides=2,
                                          seq_name="NormConv-LeakyReLU-HR_ENCODER_4"),
        ], name="HighResolution_ENCODER")(enc_inp)

        med_res = MaxPooling2D(padding='same', name="MaxPoolingMR_ENCODER")(enc_inp),

        medium_res_enc = tf.keras.Sequential([
            create_normalized_convolution(filters=base_filter_count * 2,
                                          strides=2,
                                          seq_name="NormConv-LeakyReLU-MR_ENCODER_1",
                                          input_shape=(int(image_size / 2), int(image_size / 2), channels)),
            create_normalized_convolution(filters=base_filter_count * 4,
                                          strides=2,
                                          seq_name="NormConv-LeakyReLU-MR_ENCODER_2"),
            create_normalized_convolution(filters=base_filter_count * 8,
                                          strides=2,
                                          seq_name="NormConv-LeakyReLU-MR_ENCODER_3"),
        ], name="MediumResolution_ENCODER")(med_res)

        low_res_enc = tf.keras.Sequential([
            MaxPooling2D(padding='same', name="MaxPoolingLR_ENCODER",
                         input_shape=(int(image_size / 2), int(image_size / 2), channels)),
            create_normalized_convolution(filters=base_filter_count * 4,
                                          strides=2,
                                          seq_name="NormConv-LeakyReLU-LR_ENCODER_1"),
            create_normalized_convolution(filters=base_filter_count * 8,
                                          strides=2,
                                          seq_name="NormConv-LeakyReLU-LR_ENCODER_2"),
        ], name="LowResolution_ENCODER")(med_res)

        lower_res_concat = concatenate([
            medium_res_enc,
            low_res_enc
        ], name="LowerResolutionsMerger_ENCODER")

        res_concat = Concatenate(name="Multiscale-ENCODER_2")([
            high_res_enc,
            lower_res_concat,
        ])

        conv_3_enc = create_normalized_convolution(filters=self._latent_dim ** 2 * 3,
                                                   seq_name="NormConv-LeakyReLU_ENCODER_3")(res_concat)
        glob_max_pool_enc = GlobalMaxPool2D(name="GlobalMaxPooling_ENCODER_4")(conv_3_enc)
        out_enc = Dense(self._latent_dim ** 2 * 3, name="Dense-ENCODER_5")(glob_max_pool_enc)

        self._encoder = tf.keras.Model(inputs=raw_inp, outputs=out_enc, name="Encoder")

        raw_dec_inp = Input(shape=(self._latent_dim ** 2 * 3), dtype=tf.float32)
        dec_inp = InputLayer(name='Input-DECODER_0')(raw_dec_inp)

        reshape_dec = Reshape(target_shape=[self._latent_dim, self._latent_dim, 3], name='Reshape-DECODER_1')(dec_inp)
        conv_1_dec = create_normalized_convolution_transpose(filters=self._latent_dim ** 2 * 3,
                                                             seq_name="NormConv-LeakyReLU_ENCODER_2")(reshape_dec)

        hr_dec = tf.keras.Sequential([
            create_normalized_convolution_transpose(filters=base_filter_count * 8,
                                                    strides=2,
                                                    seq_name="NormConv-LeakyReLU-HR_DECODER_1"),
            create_normalized_convolution_transpose(filters=base_filter_count * 4,
                                                    strides=2,
                                                    seq_name="NormConv-LeakyReLU-HR_DECODER_2"),
            create_normalized_convolution_transpose(filters=base_filter_count * 2,
                                                    strides=2,
                                                    seq_name="NormConv-LeakyReLU-HR_DECODER_3"),
            create_normalized_convolution_transpose(filters=base_filter_count,
                                                    strides=2,
                                                    seq_name="NormConv-LeakyReLU-HR_DECODER_4")
        ], name="HighResolution_DECODER")(conv_1_dec)

        mr_dec = tf.keras.Sequential([
            create_normalized_convolution_transpose(filters=base_filter_count * 8,
                                                    strides=2,
                                                    seq_name="NormConv-LeakyReLU-MR_DECODER_1"),
            create_normalized_convolution_transpose(filters=base_filter_count * 4,
                                                    strides=2,
                                                    seq_name="NormConv-LeakyReLU-MR_DECODER_2"),
            create_normalized_convolution_transpose(filters=base_filter_count * 2,
                                                    strides=2,
                                                    seq_name="NormConv-LeakyReLU-MR_DECODER_3")
        ], name="MediumResolution_DECODER")(conv_1_dec)

        lr_dec = tf.keras.Sequential([
            create_normalized_convolution_transpose(filters=base_filter_count * 8,
                                                    strides=2,
                                                    seq_name="NormConv-LeakyReLU-LR_DECODER_1"),
            create_normalized_convolution_transpose(filters=base_filter_count * 4,
                                                    strides=2,
                                                    seq_name="NormConv-LeakyReLU-LR_DECODER_2"),
            UpSampling2D(name="UpSampling2x-LR_DECODER")
        ], name="LowResolution_DECODER")(conv_1_dec)

        concat_lower_res = Concatenate(name="LowerResolutionsMerger_DECODER")([
            mr_dec,
            lr_dec
        ])

        up_sample_mr = UpSampling2D(name="UpSampling2x-MR_DECODER")(concat_lower_res)

        concat_high_res = Concatenate(name="HighResolutionMerger_DECODER")([
            hr_dec,
            up_sample_mr
        ])

        norm_conv_4_dec = create_normalized_convolution_transpose(
            filters=4,
            activation=Activation('sigmoid'),
            seq_name='NormConv-Sigmoid-DECODER_4'
        )(concat_high_res)

        self._decoder = tf.keras.Model(inputs=raw_dec_inp, outputs=norm_conv_4_dec, name='Decoder')

BATCH_SIZE = 24
IMG_SIZE = 256
model = autoencoder.MultiscaleAutoencoder(16)
callbacks = [tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss",
                                                      factor=0.5, patience=8, min_lr=0.00001),
                 tf.keras.callbacks.ModelCheckpoint("models/ms-model1-1", monitor="val_loss"),
                 tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=25),
                 tf.keras.callbacks.TensorBoard(log_dir="train/ms-log1-1/", histogram_freq=1),
                 tf.keras.callbacks.ModelCheckpoint('models/ms-models-epochs/model{epoch:08d}', period=5)]
model.build(tf.TensorShape((None, IMG_SIZE, IMG_SIZE, 4)))
loss = MeanAbsoluteError()
model.compile(optimizer=tf.keras.optimizers.Adamax(learning_rate=0.032), loss=loss, metrics=['accuracy'])
model.fit(train_ds_final,
          validation_data=val_ds_final,
          epochs=150,
          callbacks=callbacks,
          shuffle=True)

tilakrayal commented 2 years ago

@cb0s , I was facing different error while executing the given code.Please find the gist of it here and provide complete code to reproduce the issue.Thanks!

cb0s commented 2 years ago

@tilakrayal This is due to the model structure I had in my original package: I had a trainer module and a model module. The module containing the Autoencoder code was called autoencoder. This is why it didn't work for you.

The problem is that I did in fact use a custom dataset and therefore can provide you with the loader code but it is specific to my project and I cannot give you all my data due to the size.

It is a simple dataset loaded by dataset = tf.keras.utils.image_dataset_from_directory() which was mapped through dataset.map(lambda x, y: (x, x)).

Do you want me to add the loader code anyway or do you want to use MNIST for this purpose ?

cb0s commented 2 years ago

I should mention though that I did not get any errors. I only got a warning, this also why I didn't include any stacktrace, as there was none.

tilakrayal commented 2 years ago

@cb0s , In order to expedite the trouble-shooting process, could you please provide the complete code.Thanks!

google-ml-butler[bot] commented 2 years ago

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] commented 2 years ago

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No

ucaokylong commented 1 year ago

aww I am facing the same problem.

Jayashree2003 commented 1 year ago

I am facing the same problem.

keras-team / keras

` val_loss` is not available in training after some epochs #16215