The current DPO model returns a hardcoded value (-11 that will be exp(-11)) as the nearest integer value that is less than equal probability value across all logits (1/50257), reference in #138
...
# Check if completion is
if completion.strip() == '' or len(completion) <= 5:
return -11 # exp(-11)=1.67e-5 < 2e-5=1/50257 (typical vocab size)
...
The 50257 vocab size is taken as typical vocab size but that could be different for other models / tokenizers.
Ideally, this value would be calculated automatically like 1 / model.vocab_size , rather than a hard-coded number
The current DPO model returns a hardcoded value (-11 that will be exp(-11)) as the nearest integer value that is less than equal probability value across all logits (1/50257), reference in #138
Chunk of code of openvalidators/reward/dpo.py:
The 50257 vocab size is taken as typical vocab size but that could be different for other models / tokenizers. Ideally, this value would be calculated automatically like
1 / model.vocab_size
, rather than a hard-coded number