tensorflow / neural-structured-learning

Training neural models with structured signals.
https://www.tensorflow.org/neural_structured_learning
Apache License 2.0
980 stars 189 forks source link

Input normalisation incompatible with keras.applications.EfficientNet? #96

Closed chrisrapson closed 3 years ago

chrisrapson commented 3 years ago

From the tensorflow tutorial for nsl: "To make the model numerically stable, we normalize the pixel values to [0, 1] by mapping the dataset over the normalize function."

And from the keras documentation for EfficientNet: "EfficientNet models expect their inputs to be float tensors of pixels with values in the [0-255] range."

These look like incompatible requirements. Am I understanding that right?

Adding Adversarial Regularisation to my EfficientNet model with inputs in the [0-255] range gave NaN losses: 512/4760 [==>...........................] - ETA: 35:48 - loss: nan - categorical_crossentropy: 1.7834 - categorical_accuracy: 0.2988 - scaled_adversarial_loss: nan

Switching to Xception and normalising to the [0, 1] range looks to be training as expected: 1152/4760 [======>.......................] - ETA: 28:15 - loss: 1.7010 - categorical_crossentropy: 1.4170 - categorical_accuracy: 0.1875 - scaled_adversarial_loss: 0.2840

I'm not sure if this is really an issue with the nsl functions you have developed. I think it's strange from keras to treat EfficientNet differently to all the previous applications. I just want to check whether my observations match your expectations? And whether you have any suggestions for how to work around it?

csferng commented 3 years ago

Hi @chrisrapson.

Different models may handle input preprocessing differently. In general, scaling input values to [0, 1] or [-1, 1] helps improve models' numeric stability during training. The EfficientNet model has this built-in, as noted here: For EfficientNet, input preprocessing is included as part of the model (as a Rescaling layer), ... On the other hand, the example model in the NSL tutorial is a simple feed-forward network without such rescaling layer. Instead, we rescale the input during dataset preparation. If you are adding adversarial regularization to EfficientNet, it's fine to let the input be in [0, 255] and rely on the built-in Rescaling layer.

Regarding the nan issue, could you share your model and training set up? In this colab I trained an EfficientNetB0 model with adversarial regularization on the ImageNet dataset without running into nan.

chrisrapson commented 3 years ago

Thanks for the example. I will work through the differences between that and my code and try to figure out what is causing the problem. Unfortunately, my code has too much proprietary information embedded at the moment. If I figure out a minimal working example I will post that.

The first thing I noticed is that I am using tf version 2.3.0. Trying to run your code with tf2.3 gives an error:

    OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

...

  File "C:\Users\username\project\test_nsl_efficientnet.py", line 34, in <module>
    dataset = dataset.map(preprocess).batch(32)

When I tried upgrading to tf 2.5 your code ran fine... on CPU. I'm in the process of upgrading my CUDA and cuDNN environment at the moment, I'll keep you posted on the results. Interestingly, my code also ran fine (no NaNs) on CPU.

chrisrapson commented 3 years ago

It looks like updating to tf2.5 fixed the issue for me. I haven't dug deeper to figure out what the difference is. You may want to run some testing and if tf2.3 fails for you as well, update the requirements to include tf2.5.

I have now hit the NotImplementedError for "Saving AdversarialRegularization models is currently not supported." I'm trying the suggested workaround with saving the base_model and the weights separately. I haven't found any examples, but should be able to figure it out. If I hit any complications, I will raise a new issue.

From my point of view, this issue can be closed. I will leave it open for now in case you want to investigate the tf version compatibility.

chrisrapson commented 3 years ago

Sorry... I spoke too soon.

I'd deactivated mixed_precision to simplify my testing, and when I added it back in, I hit the error again. It turns out, mixed_precision is a more likely candidate for the source of the error than the type of keras model. (Except that I can't reproduce the error with your demo code, so I intend to keep digging tomorrow.)

There is a bug in tf2.5 which makes mixed_precision and EfficientNet incompatible: https://github.com/tensorflow/tensorflow/issues/50171 . That bug is fixed in tensorflow==2.6.0rc1, so I have used that for the following tests:

Code tf version keras model type mixed_precision AdversarialRegulation NaN?
chrisrapson rc2.6.1 EfficientNetB0 no yes not before end of epoch 2 (476 batches per epoch)
chrisrapson rc2.6.1 EfficientNetB0 yes no not before end of epoch 2 (476 batches per epoch)
chrisrapson rc2.6.1 EfficientNetB0 yes yes almost immediately
chrisrapson rc2.6.1 Xception yes yes ~batch 373
csferng rc2.6.1 EfficientNetB0 yes yes not before end of epoch 3

At the same time, I noticed that when I was training with my data and AversarialRegulation, the validation accuracy did not improve. In the past, that's been a sign that my inputs have been incorrectly scaled. I will double check that tomorrow. (I don't think I have made any changes to how I preprocess data, and your first reply said that I shouldn't need to scale inputs any differently with or without AdversarialRegulation.)

csferng commented 3 years ago

@chrisrapson, besides input scaling issue, another possible source of numerical instability is in the gradient computation for a cross-entropy loss after a softmax activation.

A common workaround is to let the model output logits (i.e. no activation in the last layer), and compute the loss from logits directly (e.g. tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) instead of tf.keras.losses.sparse_categorical_crossentropy).

csferng commented 3 years ago

Closing this issue for now. Feel free to reopen if you have any follow-up questions.