google-deepmind / language_modeling_is_compression

Apache License 2.0
101 stars 14 forks source link

log-loss vs cross-entropy #5

Closed atiorh closed 12 months ago

atiorh commented 1 year ago

Hi @anianruoss, great work and thanks for sharing the code to reproduce the experiments!

I am having trouble understanding why the Transformer models are not trained with standard softmax cross entropy loss (with cross-vocabulary normalization) but rather trained to maximize the marginal unnormalized logit of the observed/correct tokens because this objective doesn’t enforce a high (cross-vocabulary normalized) likelihood prediction for the correct tokens as the average unnormalized logit value drifts higher with this loss. The paper also talks about the loss as: “However, Eq. (2) is exactly the same objective used to train current foundation models, i.e., the log-loss” (Section 2) but current foundation models (autoregressive LMs) indeed use standard softmax cross entropy unless I am grossly mistaken (example).

On the other hand, during your actual compression code, I do see that you are taking the exp (which interprets model outputs as implicit log) and normalizing the outputs of the trained Transformer model across the vocabulary into the corresponding CDF.

Could you please help me resolve this conflict?

anianruoss commented 12 months ago

Yes, thank you for raising this issue. We use the log-loss in the paper but forgot to add it to the open-sourced code. We fixed this now.