Closed janjitse closed 3 years ago
Hello!
This is the expected behavior, if you want any reduction on the loss, you have to do it yourself on your side, not inside the respective compute_loss function.
Hi, thanks for the explanation. I realized it was probably by design, it's just odd that it differs so much in behavior from the Pytorch version. Is there any plan to bring those more inline in this regards? Probably a breaking change, I don't have a clear overview of how much would break, even internally within the Transformers library.
Nothing planed to align this with Python and we won't. The reason is because when training with a distribute strategy, TensorFlow doesn't allow a reduction other than None
or Sum
. Knowing that we have our own custom trainer and we cannot apply the change you would like as it will make it fails for such cases.
Makes sense, didn't think about the incompatibility of the AUTO reduction with the distribute strategies and the custom trainer.
I'll try to make a small patch over the weekend with an update to the documentation in the docstrings, as it's currently not in line with the actual (and intended) output.
That would be awesome! Thanks!
Environment info
transformers
version: 4.3.0.dev0Who can help
@jplu,
Information
Model I am using (Bert, XLNet ...): TFGPT2LMHeadModel
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
I was converting the example of perplexity calculation of fixed-length models perplexity calculation of fixed-length models to Tensorflow, and ran into an inconsistency in the implementation of compute_loss, compared to the implementation in the Pytorch version of the model.
For Tensorflow, when calling a model with inputs and labels (model(input_ids = input_ids, labels = labels), there is no reduction being done on the output of SparseCategoricalCrossentropy loss function (i.e. it is called explicitly with reduction=tf.keras.losses.Reduction.NONE for all tasks), as defined in modeling_tf_utils.py, while for Pytorch, the loss function CrossEntropyLoss() is called with the standard reduction (just the mean), which seems a bit unexpected to me.
After modifying the code to do an explicit tf.math.reduce_mean on the outcome of the model, I was able to reproduce the Pytorch outcome exactly.
Tensorflow version:
outputs = model(input_ids, labels = target_ids)
log_likelihood = tf.math.reduce_mean(outputs[0] * trg_len)
Pytorch version:outputs = model(input_ids, labels=target_ids)
log_likelihood = outputs[0] * trg_len
Expected behavior
Outcome of TFGPT2LMHeadModel.call(input_ids=input_ids,labels=labels) to have same tensor shapes as outcome of GPT2LMHeadModel.call(input_ids=input_ids,labels=labels)