sparsemax returns nan for big values

tensorflow / addons

Useful extra functionality for TensorFlow 2.x maintained by SIG-addons

Apache License 2.0

1.69k stars 611 forks source link

sparsemax returns nan for big values #2314

Open w4nderlust opened 3 years ago

w4nderlust commented 3 years ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
TensorFlow version and how it was installed (source or binary): 2.3.1
TensorFlow-Addons version and how it was installed (source or binary): 0.12.0-dev
Python version: 3.8.5
Is GPU used? (yes/no): yes

Describe the bug

For very large values, sparsemax returns NaN.

Code to reproduce the issue

import tensorflow_addos as tfa

tfa.activations.sparsemax([1.36762051e+6, 1.59594639e+6])
Out[0]: <tf.Tensor: shape=(2,), dtype=float32, numpy=array([0., 1.], dtype=float32)>
tfa.activations.sparsemax([1.36762051e+7, 1.59594639e+7])
Out[1]: <tf.Tensor: shape=(2,), dtype=float32, numpy=array([0., 1.], dtype=float32)>
tfa.activations.sparsemax([1.36762051e+8, 1.59594639e+8])
Out[2]: <tf.Tensor: shape=(2,), dtype=float32, numpy=array([nan, nan], dtype=float32)>
tfa.activations.sparsemax([1.36762051e+9, 1.59594639e+9])
Out[3]: <tf.Tensor: shape=(2,), dtype=float32, numpy=array([nan, nan], dtype=float32)>

spmx = tfa.layers.Sparsemax()
spmx([1.36762051e+6, 1.59594639e+6])
Out[4]: <tf.Tensor: shape=(2,), dtype=float32, numpy=array([0., 1.], dtype=float32)>
spmx([1.36762051e+7, 1.59594639e+7])
Out[5]: <tf.Tensor: shape=(2,), dtype=float32, numpy=array([0., 1.], dtype=float32)>
spmx([1.36762051e+8, 1.59594639e+8])
Out[6]: <tf.Tensor: shape=(2,), dtype=float32, numpy=array([nan, nan], dtype=float32)>

aaronmondal commented 3 years ago

@w4nderlust Here is the implementation of _compute_2d_sparsemax. You can see that the implementation actually has specific checks for nan values, so this is not a bug.

The nans appear because the large inputs provided are out of the "reliably usable" precision range for float32. You could cast your inputs to float64, then the calculations will still work:

import tensorflow as tf
from tensorflow_addons.activations import sparsemax

single_precision_inputs = tf.constant([1.3e+8, 1.5e+8], dtype=tf.float32)
double_precision_inputs = tf.constant([1.3e+8, 1.5e+8], dtype=tf.float64)

print(f'Sparsemax for single: {sparsemax(single_precision_inputs)}')
print(f'Sparsemax for double: {sparsemax(double_precision_inputs)}')

Sparsemax for single: [nan nan]
Sparsemax for double: [0. 1.]

Just note that using double precision (i.e. float64) inside an actual model is typically significantly slower than using float32, so I would recommend modifying the model architecture or training setup in such a way that the values do not get too large, e.g. by using normalization techniques or reducing the learning rate.

w4nderlust commented 3 years ago

Thanks for the clarification @aaronmondal ! Appreciated.

My personal take, from the user perspective, is that there should probably be a saturation point above which the same value is returned rather than returning NaN. The reason is that it is difficult to anticipate the ranges of what multiple layers within a model will do, so having a function I can trust not to return NaN in any circumstance (as they propagate very quickly), and instead returns a default, maybe with a warning, would be very nice.

One can add normalization or clipping, but that's a modeling decision, actually adding batch norm to a layer caused the issue in the first place for me.

Moreover, the implementation of softmax doesn't suffer from this issue:

from tensorflow.nn import softmax

single_precision_inputs = tf.constant([1.3e+8, 1.5e+8], dtype=tf.float32)
double_precision_inputs = tf.constant([1.3e+8, 1.5e+8], dtype=tf.float64)

print(f'Softmax for double: {softmax(double_precision_inputs)}')
print(f'Softmax for single: {softmax(single_precision_inputs)}')

Returns:

Out[1]: Softmax for double: [0. 1.]
Out[2]: Softmax for single: [0. 1.]`

So maybe the same mechanism adopted in softmax for preventing the issue can be used for sparsemax too? Or maybe softmax can be a fallback in those cicumstances? Not sure.

bhack commented 3 years ago

The origin of this behaviour was in https://github.com/tensorflow/tensorflow/pull/21183

aaronmondal commented 3 years ago

Ok I think I see where the problem is.

A normalization was removed in the PR that @bhack pointed out. It may be possible to significantly increase the range at which sparsemax operates by putting that back in.

A reduce_mean would make calculations slightly slower, but I assume the usable range would expand (roughly, very roughly :D) from something like |cumsum(inputs)| < B to |max(inputs) - min(inputs)| < B, where B is a value above which calculations collapse.

If we were to change this I also assume that we would need specific checks for nan/inf/-inf inputs since reduce_mean probably doesn't handle those well. These checks would have to run for every call so this would mean even slower execution.

@w4nderlust TBH I don't think expanding the range is worth the slowdown and increased code complexity. Since a "fix" would essentially just be a normalization step inside the sparsemax implementation anyways, I think updating the docs with a note like If you experience numerical instabilities with this op consider normalizing your inputs. would be the best way to handle this.

w4nderlust commented 3 years ago

I understand @aaronmondal .

Although I am using the sparsemax not for a specific model, but my project is itself a framework. That means that i cannot be sure of what inputs users will provide, and because of that I should be careful and defensive.

For this reason, I see two options:

I implement my own sparsemax by adding the normalization and wrapping yours (pointing out exactly what normalization should i use would be great in this case).
You may consider adding a "slow_sparsemax" which has an increased range. I believe this has to advantages: it allows you to point to it in the docs saying "if you are experiencing numerical instability, please use slow_sparsemax" and also it allows people like me building frameworks to build on top of the sparsemax by having a more reliable building block with different speed / range tradeoff.

Either way, greatly appreciate your help.

bhack commented 3 years ago

/cc @rmlarsen as It was https://github.com/tensorflow/tensorflow/pull/21183 reviewer.