Open RylanSchaeffer opened 1 month ago
I don't know if this is the culprit, but I noticed that the tutorial and I both use bf16
, and in bf16
, the two following quantities don't agree:
torch.einsum("bse,bse->bs", prob_dist, logits) - torch.sum(prob_dist * logits, dim=-1)
The difference is non-zero:
tensor([[ 0.0000, 0.1250, -0.1250, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.1250, 0.0000, ...0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000]], device='cuda:0',
dtype=torch.bfloat16)
Following this previous PR, it might be worthwhile to consider upcasting the tensors before computing logged quantities.
But I don't know if this explains how the entropy is becoming negative...
On another PPOv2 run, I again observe negative entropy:
System Info
transformers
version: 4.44.0Information
Tasks
examples
folderReproduction
In TRL's PPOv2Trainer TLDR example, run the default command:
Expected behavior
Entropy for a discrete distribution (such as that of a language model) must be non-negative. However, when I run the official example, the entropy can be negative:
I don't think I'm making a mistake because this negative entropy also appears in the official documentation. Specifically, look early in training, at maybe 20k episodes:
The documentation describes
objective/entropy
as "The mean entropy of the policy, indicating the randomness of the actions chosen by the policy." If this is incorrect, and some other quantity is computed instead, then perhaps the documentation needs to be updated?