Hi @anianruoss, great work and thanks for sharing the code to reproduce the experiments!
I am having trouble understanding why the Transformer models are not trained with standard softmax cross entropy loss (with cross-vocabulary normalization) but rather trained to maximize the marginal unnormalized logit of the observed/correct tokens because this objective doesn’t enforce a high (cross-vocabulary normalized) likelihood prediction for the correct tokens as the average unnormalized logit value drifts higher with this loss. The paper also talks about the loss as: “However, Eq. (2) is exactly the same objective used to train current foundation models, i.e., the log-loss” (Section 2) but current foundation models (autoregressive LMs) indeed use standard softmax cross entropy unless I am grossly mistaken (example).
On the other hand, during your actual compression code, I do see that you are taking the exp (which interprets model outputs as implicit log) and normalizing the outputs of the trained Transformer model across the vocabulary into the corresponding CDF.
Hi @anianruoss, great work and thanks for sharing the code to reproduce the experiments!
I am having trouble understanding why the Transformer models are not trained with standard softmax cross entropy loss (with cross-vocabulary normalization) but rather trained to maximize the marginal unnormalized logit of the observed/correct tokens because this objective doesn’t enforce a high (cross-vocabulary normalized) likelihood prediction for the correct tokens as the average unnormalized logit value drifts higher with this loss. The paper also talks about the loss as: “However, Eq. (2) is exactly the same objective used to train current foundation models, i.e., the log-loss” (Section 2) but current foundation models (autoregressive LMs) indeed use standard softmax cross entropy unless I am grossly mistaken (example).
On the other hand, during your actual compression code, I do see that you are taking the exp (which interprets model outputs as implicit log) and normalizing the outputs of the trained Transformer model across the vocabulary into the corresponding CDF.
Could you please help me resolve this conflict?