Silent coercion from logits to probabilities

beyarkay commented 1 year ago

System information.

Have I written custom code (as opposed to using a stock example script provided in Keras): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): M1 Mac
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): unknown 2.10.0
Python version: Python 3.10.5
Bazel version (if compiling from source): N/A
GPU model and memory: M1 Mac
Exact command to reproduce: N/A

Describe the problem.

The code for cross entropy will silently coerce the output vector so that it will sum to one. It is very easy to forget about the from_logits parameter (which is false by default), and mistakenly pass in logits to cross entropy which don't sum to one but which get silently scaled to sum to one regardless.

The code snippet in question: https://github.com/keras-team/keras/blob/master/keras/backend.py#L5589

def categorical_crossentropy(target, output, from_logits=False, axis=-1):
    ...
    # Adjust the predictions so that the probability of
    # each class for every sample adds up to 1
    # This is needed to ensure that the cross entropy is
    # computed correctly.
    output = output / tf.reduce_sum(output, axis, True)

This sort of error is easy to catch (by just checking that the output isn't outrageously far from summing to 1) and would solve a lot of common mistakes with keras beginners.

Describe the current behavior.

If I mistakenly pass logits to cross entropy loss and forget about the from_logits parameter, I will get no error but will get incorrect results

Describe the expected behavior.

If I pass something to cross entropy loss that's obviously not probabilities, it's probably a user error and the command should fail.

Contributing.

Do you want to contribute a PR? (yes/no): yes
If yes, please read this page for instructions
Briefly describe your candidate solution(if contributing): Add an assertion that fails if the output sums to a value that isn't within a tolerance to 1.

Standalone code to reproduce the issue.

import tensorflow as tf
import numpy as np
y_true = [[0, 1, 0], [0, 0, 1]]
y_logits = [[0.15, 0.05, 0], [0.01, 0.08, 0.01]]
# Assert that the `y_logits` don't sum to 1 (so they're not valid probabilities)
assert not np.any(np.array(y_logits).sum(axis=1) == 1)
# But we can still pass them in to categorical_crossentropy regardless ):
# no error is omitted
loss = tf.keras.losses.categorical_crossentropy(y_true, y_logits)
# The loss is still calculated fine, even though it's meaningless
assert np.isclose(
    loss.numpy(),
    np.array([1.3862944, 2.302585 ])
).all()

Thanks!

tilakrayal commented 1 year ago

@beyarkay, I tried to reproduce the mentioned code on tensorflow v2.12. Kindly find the gist of it here and could you please confirm the output was similar at your end aswell. Thank you!

beyarkay commented 1 year ago

Yes, I can confirm that the programs are the same. Both of them:

Don't output/print anything,
don't raise an assertion error,
and do exit successfully

Do you know if the coercion from logits to probabilities is intended behaviour or not? It seems like a mistake to me, but I'm not sure about the history of the function (and git blame wasn't particularly enlightening)

beyarkay commented 1 year ago

@tilakrayal any chance on an update on this issue before the weekend?

sachinprasadhs commented 1 year ago

Thanks for reporting the issue.

AFAIK, We don't have explicit check/validation for y_pred probabilities sum to 1, but it is expected to provide the probability distribution values.

In general use cases for example when the softmax activation function is used in your model, it gives the probability distribution. Consider below example:

import tensorflow as tf
import numpy as np
y_true = [[0, 1, 0], [0, 0, 1]]
y_logits = [[0.15, 0.05, 0], [0.01, 0.08, 0.01]]
y_pred = tf.keras.layers.Softmax()(np.array(y_logits)).numpy()
print(y_pred)

[[0.36159232 0.32718226 0.3112254 ]
 [0.32546702 0.34906602 0.32546702]]

beyarkay commented 1 year ago

The issue I have is more that I can pass data to tf.keras.losses.categorical_crossentropy which is obviously incorrect, but no errors are raised.

If you ran a store that accepts USD and I tried to pay you in INR, you would check the currency and tell me to go exchange my money. You definitely wouldn't take 100 INR, tell yourself that it's the same as 100 USD because I forgot to tell you that I'm giving you INR, and then let me buy whatever I want that's worth 100 USD.

But right now, keras is happily taking my input, sees that they don't sum to 1 (so definitely aren't valid probabilities), but nonetheless will coerce them into probabilities because the user didn't check the box that says "these aren't probabilities". And then the user doesn't even get a warning saying that the coercion was made.

People make mistakes. I can understand not wanting to do some checks in the library that the user really should be doing themselves, that will just cause the complexity of the codebase to explode. But when the code takes the input and forces it to sum to 1, it is doing the worst of both worlds. It's not checking that the user's input was correct, but it's also succeeding when the user's input was incorrect.

Would it be possible to remove line https://github.com/keras-team/keras/blob/master/keras/backend.py#L5589 and raise an error if the probabilities do not sum to 1? I'm happy to write the PR myself, I just want to know that it won't be shot down if I do so.

Frightera commented 1 year ago

Hi @beyarkay, while waiting the Team's response let me provide some inputs.

I think this requires much more effort rather than adding a simple assertion. Here are some of the errors you can get:

1) OperatorNotAllowedInGraphError, using a symbolic tf.Tensor as a Python bool is not allowed 2) Using a symbolic tf.Tensor as a Python bool is not allowed 3) RaggedTensor object has no len()

So checking and catching mistakes in graph mode is a challenging problem, and RaggedTensor function calls works different (as far as I know).

beyarkay commented 1 year ago

Hi @Frightera thanks for raising the subtleties that I missed. I do agree that the proof of concept assert I had would not be good for a production ready system. There absolutely are errors that will need to be handled, but I wanted to know from the keras team if they would merge a PR that solved this before I spent the time implementing a battle-ready fix for this issue.

The reason why I'm hesitant is because the original code had a comment explicitly describing this behaviour:

    # Adjust the predictions so that the probability of
    # each class for every sample adds up to 1
    # This is needed to ensure that the cross entropy is
    # computed correctly.
    output = output / tf.reduce_sum(output, axis, True)

Which makes me think that this might be a feature, and not a bug. While I'm not sure what the feature would be, I'm eager to hear from the keras team about whether or not this coercion is intended behaviour.

Once I get confirmation that they would (theoretically) merge a fix, then I'll start work on a keras-native way to check that the user isn't passing logits in without specifying from_logits=True.

gowthamkpr commented 1 year ago

Hi @beyarkay Thank you so much for reporting this. Keras team has looked into it and would not go ahead with this at this point in time as this is not a bug as mentioned.

beyarkay commented 1 year ago

Thanks for the response, could you clarify why this is not considered to be unexpected behaviour?

sachinprasadhs commented 1 year ago

@beyarkay , based on the comment here https://github.com/keras-team/tf-keras/issues/47 and here https://github.com/keras-team/tf-keras/issues/47 this is not considered as bug.

beyarkay commented 1 year ago

It seems like @Frightera 's comment just means that the implementation will be difficult, and has nothing to do with whether this is a bug or not.

And then for this comment:

AFAIK, We don't have explicit check/validation for y_pred probabilities sum to 1, but it is expected to provide the probability distribution values.

I don't understand why this means that the coercion isn't a bug. Either the library should validate that the inputs to a function are correct or a library should do no validation but make it clear to the caller that they need to check the inputs themselves.

Keras seems to be doing the worst of both worlds, where good inputs are pointlessly put through an operation to ensure they sum to 1 (even though they should already sum to 1) and bad inputs are silently coerced to sum to 1.

Would the maintainers be open to removing this line

output = output / tf.reduce_sum(output, axis, True)

So that bad inputs will raise an error and good inputs will be slightly faster to process?

NicholasCJL commented 1 year ago

Would the maintainers be open to removing this line
output = output / tf.reduce_sum(output, axis, True)
So that bad inputs will raise an error and good inputs will be slightly faster to process?

This might break some edge cases where one might want to calculate the cross-entropy using just part of the post-softmax probability vector (instead of the usual calculation on the entire softmax output) i.e. the probabilities in that vector do not sum up to 1.

Normalizing the vector is still required to calculate the proper cross-entropy, and assuming the inputs are logits simply because they don't sum up to 1 will also result in a wrong result.

I'm not sure about the implementation of something like this, but maybe checking for the presence of a softmax activation during compilation followed by a warning if it appears incompatible with the loss might be another possible solution?

beyarkay commented 1 year ago

This might break some edge cases where one might want to calculate the cross-entropy using just part of the post-softmax probability vector (instead of the usual calculation on the entire softmax output) i.e. the probabilities in that vector do not sum up to 1.

I'm not 100% sure about the math, but as far as I can tell this isn't a mathematically valid thing to do? (Please help me understand/point me in the right direction if I am wrong here).

Normalizing the vector is still required to calculate the proper cross-entropy

Apologies again, but I don't understand this. What are the circumstances where the vector does not sum to one but calculating its cross-entropy is a valid thing to do and so we would want to normalize the vector?

keras-team / tf-keras

Silent coercion from logits to probabilities #47