Adjust vocab size calculation of DPO model to be dynamic

The current DPO model returns a hardcoded value (-11 that will be exp(-11)) as the nearest integer value that is less than equal probability value across all logits (1/50257), reference in #138

Chunk of code of openvalidators/reward/dpo.py:

...
# Check if completion is 
        if completion.strip() == '' or len(completion) <= 5:
            return -11 # exp(-11)=1.67e-5 < 2e-5=1/50257 (typical vocab size)
...

The 50257 vocab size is taken as typical vocab size but that could be different for other models / tokenizers. Ideally, this value would be calculated automatically like 1 / model.vocab_size , rather than a hard-coded number

opentensor / validators

Adjust vocab size calculation of DPO model to be dynamic #141