Gradients return NaN values for flow loss

anianruoss commented 6 years ago

When running this simple example gradient_val in lbfgs starts to contain only NaN values after a certain number of iterations. This causes the lbfgs solver to terminate with the message "ABNORMAL_TERMINATION_IN_LNSRCH" and to output a loss of NaN value.

import random

import numpy as np
import stadv
import tensorflow as tf

random.seed(0)
np.random.seed(0)

num_classes = 10
batch_size = 7
C = 1
H = 5
W = 5
tau_val = 0.05

def sample_net(x):
    left_ones = tf.ones((batch_size, H, 1, W))
    right_ones = tf.ones((batch_size, H, C, num_classes))

    bilinear_sum = tf.squeeze(
        tf.reduce_sum(
            tf.matmul(tf.matmul(left_ones, x), right_ones),
            1
        )
    )

    return bilinear_sum

test_images = np.random.random_sample((batch_size, H, W, C)).astype(np.float32)
target_labels = np.random.randint(0, num_classes, batch_size)

flows_x0 = np.random.random_sample((batch_size, 2, H, W))

images = tf.placeholder(tf.float32, shape=[None, H, W, C], name='images')
targets = tf.placeholder(tf.int64, shape=[None], name='targets')
flows = tf.placeholder(tf.float32, shape=[None, 2, H, W], name='flows')
tau = tf.placeholder_with_default(
    tf.constant(tau_val, dtype=tf.float32), shape=[], name='tau'
)

perturbed_images = stadv.layers.flow_st(images, flows, data_format='NHWC')
logits = sample_net(perturbed_images)

loss_adv = stadv.losses.adv_loss(logits, targets)
loss_flow = stadv.losses.flow_loss(flows)
loss = loss_adv + tau * loss_flow

with tf.Session() as sess:
    tf.global_variables_initializer().run()

    tf_results = stadv.optimization.lbfgs(
        loss,
        flows,
        flows_x0=flows_x0,
        feed_dict={images: test_images, targets: target_labels},
        sess=sess
    )

print(tf_results['loss'])
print(tf_results['info'])

Using the TensorFlow Debugger I was able to pinpoint the problem to the tf.sqrt of the flow_loss. This can be verified by setting tau_val = 0 (essentially disabling the flow_loss), which leads to convergence and a loss of 0.

Do you know how to fix this problem?

berangerd commented 6 years ago

The problem is coming from the adversarial loss, which is 0 in your case as you point out (given the output of your sample_net). So there is no trade-off between adversarial and flow loss, and in order to enforce a smooth flow the solution that is found is to have a constant flow, which produces NaN gradients because of the square root of the difference between a flow and its shifted version in flow_loss. If you used flow_loss with the argument padding_mode='CONSTANT' you would obtain a different behavior (vanishing flow).

In any case I think it's just a toy example, you should solve your problem by changing the implementation of your sample_net (to have it depend on the perturbed image).

anianruoss commented 6 years ago

Thank you for your detailed answer!

berangerd commented 6 years ago

Glad it solved your issue. Just a couple quick extra comments for completeness:

although the LBFGS solver reports an error (and a not very user-friendly one) due to the presence of NaNs the optimization has completed. The only other problematic case I can think of is having a constant flow as initial state, which would directly give NaNs and prevent any optimization.
it might be possible to prevent NaN by clipping the argument of the sqrt (https://github.com/tensorflow/tensorflow/issues/4914) but it looks like a trick that might have side effects.

If you can think of a more user-friendly treatment of this case let me know.

anianruoss commented 6 years ago

Yes, it would be nice to be able to initialize the solver with zero flows. I was able to fix the problem by adding a small epsilon to the norm in losses.py:

import sys

def _l2_diff_norm_squared(t1, t2, axis):
    """Shortcut for getting the squared L2 norm of the difference
    between two tensors when slicing on the second axis.
    """
    return tf.norm(
        t1[:, axis] - t2[:, axis] + sys.float_info.epsilon,
        ord='euclidean',
        axis=(1, 2)
    ) ** 2

I think that this is a more elegant solution than clipping the argument. Do you think you could include it in your pip package?

berangerd commented 6 years ago

Actually looking back at _l2_diff_norm_squared made me realize that there is a difference in flow_loss compared to Eq. (4) from arXiv:1801.02612: the summation over p (for looping over all pixels) is currently done inside of the square root, and not outside. So the results will be numerically different. Although the idea of enforcing local smoothness is present in the current implementation I have not implemented Eq. (4). Let me fix that ASAP and unit test it against a simple (non vectorized) calculation.

berangerd commented 6 years ago

I have pushed modifications (see https://github.com/rakutentech/stAdv/commit/c7ebb7d39c3ae730b72d4e4c08a8c57d5666c1a3) and made it a version 0.2. You can upgrade with pip install -U stadv. With the correct implementation of the flow loss the results (as found in the demo notebook) do not look very different. However, it exacerbates the problem of NaN gradients. Similar to the solution you have suggested, I have introduced an epsilon parameter to flow_loss (with default value 1e-8) to prevent tf.sqrt(0).

I am closing this issue, thank you for pointing this out. Feel free to reopen if anything looks fishy!

anianruoss commented 6 years ago

Perfect, thank you for your help!

rakutentech / stAdv

Gradients return NaN values for flow loss #3