huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.62k stars 27.15k forks source link

TF loss function output inconsistent with Pytorch one for multiple tasks #9771

Closed janjitse closed 3 years ago

janjitse commented 3 years ago

Environment info

Who can help

@jplu,

Information

Model I am using (Bert, XLNet ...): TFGPT2LMHeadModel

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

I was converting the example of perplexity calculation of fixed-length models perplexity calculation of fixed-length models to Tensorflow, and ran into an inconsistency in the implementation of compute_loss, compared to the implementation in the Pytorch version of the model.

For Tensorflow, when calling a model with inputs and labels (model(input_ids = input_ids, labels = labels), there is no reduction being done on the output of SparseCategoricalCrossentropy loss function (i.e. it is called explicitly with reduction=tf.keras.losses.Reduction.NONE for all tasks), as defined in modeling_tf_utils.py, while for Pytorch, the loss function CrossEntropyLoss() is called with the standard reduction (just the mean), which seems a bit unexpected to me.

After modifying the code to do an explicit tf.math.reduce_mean on the outcome of the model, I was able to reproduce the Pytorch outcome exactly.

Tensorflow version: outputs = model(input_ids, labels = target_ids) log_likelihood = tf.math.reduce_mean(outputs[0] * trg_len) Pytorch version: outputs = model(input_ids, labels=target_ids) log_likelihood = outputs[0] * trg_len

Expected behavior

Outcome of TFGPT2LMHeadModel.call(input_ids=input_ids,labels=labels) to have same tensor shapes as outcome of GPT2LMHeadModel.call(input_ids=input_ids,labels=labels)

jplu commented 3 years ago

Hello!

This is the expected behavior, if you want any reduction on the loss, you have to do it yourself on your side, not inside the respective compute_loss function.

janjitse commented 3 years ago

Hi, thanks for the explanation. I realized it was probably by design, it's just odd that it differs so much in behavior from the Pytorch version. Is there any plan to bring those more inline in this regards? Probably a breaking change, I don't have a clear overview of how much would break, even internally within the Transformers library.

jplu commented 3 years ago

Nothing planed to align this with Python and we won't. The reason is because when training with a distribute strategy, TensorFlow doesn't allow a reduction other than None or Sum. Knowing that we have our own custom trainer and we cannot apply the change you would like as it will make it fails for such cases.

janjitse commented 3 years ago

Makes sense, didn't think about the incompatibility of the AUTO reduction with the distribute strategies and the custom trainer.

I'll try to make a small patch over the weekend with an update to the documentation in the docstrings, as it's currently not in line with the actual (and intended) output.

jplu commented 3 years ago

That would be awesome! Thanks!