Open w4nderlust opened 3 years ago
@w4nderlust Here is the implementation of _compute_2d_sparsemax
. You can see that the implementation actually has specific checks for nan
values, so this is not a bug.
The nan
s appear because the large inputs provided are out of the "reliably usable" precision range for float32
. You could cast your inputs to float64
, then the calculations will still work:
import tensorflow as tf
from tensorflow_addons.activations import sparsemax
single_precision_inputs = tf.constant([1.3e+8, 1.5e+8], dtype=tf.float32)
double_precision_inputs = tf.constant([1.3e+8, 1.5e+8], dtype=tf.float64)
print(f'Sparsemax for single: {sparsemax(single_precision_inputs)}')
print(f'Sparsemax for double: {sparsemax(double_precision_inputs)}')
Sparsemax for single: [nan nan]
Sparsemax for double: [0. 1.]
Just note that using double precision (i.e. float64
) inside an actual model is typically significantly slower than using float32
, so I would recommend modifying the model architecture or training setup in such a way that the values do not get too large, e.g. by using normalization techniques or reducing the learning rate.
Thanks for the clarification @aaronmondal ! Appreciated.
My personal take, from the user perspective, is that there should probably be a saturation point above which the same value is returned rather than returning NaN. The reason is that it is difficult to anticipate the ranges of what multiple layers within a model will do, so having a function I can trust not to return NaN in any circumstance (as they propagate very quickly), and instead returns a default, maybe with a warning, would be very nice.
One can add normalization or clipping, but that's a modeling decision, actually adding batch norm to a layer caused the issue in the first place for me.
Moreover, the implementation of softmax doesn't suffer from this issue:
from tensorflow.nn import softmax
single_precision_inputs = tf.constant([1.3e+8, 1.5e+8], dtype=tf.float32)
double_precision_inputs = tf.constant([1.3e+8, 1.5e+8], dtype=tf.float64)
print(f'Softmax for double: {softmax(double_precision_inputs)}')
print(f'Softmax for single: {softmax(single_precision_inputs)}')
Returns:
Out[1]: Softmax for double: [0. 1.]
Out[2]: Softmax for single: [0. 1.]`
So maybe the same mechanism adopted in softmax for preventing the issue can be used for sparsemax too? Or maybe softmax can be a fallback in those cicumstances? Not sure.
The origin of this behaviour was in https://github.com/tensorflow/tensorflow/pull/21183
Ok I think I see where the problem is.
A normalization was removed in the PR that @bhack pointed out. It may be possible to significantly increase the range at which sparsemax
operates by putting that back in.
A reduce_mean
would make calculations slightly slower, but I assume the usable range would expand (roughly, very roughly :D) from something like |cumsum(inputs)| < B
to |max(inputs) - min(inputs)| < B
, where B
is a value above which calculations collapse.
If we were to change this I also assume that we would need specific checks for nan/inf/-inf
inputs since reduce_mean
probably doesn't handle those well. These checks would have to run for every call so this would mean even slower execution.
@w4nderlust TBH I don't think expanding the range is worth the slowdown and increased code complexity. Since a "fix" would essentially just be a normalization step inside the sparsemax
implementation anyways, I think updating the docs with a note like If you experience numerical instabilities with this op consider normalizing your inputs.
would be the best way to handle this.
I understand @aaronmondal .
Although I am using the sparsemax not for a specific model, but my project is itself a framework. That means that i cannot be sure of what inputs users will provide, and because of that I should be careful and defensive.
For this reason, I see two options:
slow_sparsemax
" which has an increased range. I believe this has to advantages: it allows you to point to it in the docs saying "if you are experiencing numerical instability, please use slow_sparsemax" and also it allows people like me building frameworks to build on top of the sparsemax by having a more reliable building block with different speed / range tradeoff.Either way, greatly appreciate your help.
/cc @rmlarsen as It was https://github.com/tensorflow/tensorflow/pull/21183 reviewer.
System information
Describe the bug
For very large values, sparsemax returns NaN.
Code to reproduce the issue