Open simon900314 opened 3 months ago
Hi, thanks for the issue!
We use this function to tokenize the different representations, which will add bos/eos automatically if add_special_tokens
is True.
However, for the base model, I just checked and the default is False if this parameter is not explicitely given in the model config, so I believe you are right!
cc @NathanHB for a double check
It seems you are right yes. I will double check on my side. Thanks for the catch !
Hi, thanks for the issue!
We use this function to tokenize the different representations, which will add bos/eos automatically if
add_special_tokens
is True.However, for the base model, I just checked and the default is False if this parameter is not explicitely given in the model config, so I believe you are right!
cc @NathanHB for a double check
So when using BaseModel() to load a hf format model, do i need to delete [:-1]
in request.tokenized_context + request.tokenized_continuation[:-1] for request in batch
?
I also discover this problem because the last token of the original choice is missing in prepared_batch fed into the model. But when I deleted [:-1]
and run exact the same evaluation I didn't get expected results. The accuracy curve remains at random guess instead of an upward trend as before.
Is there anything else should I modify? Thank you!
Hi,
In:
https://github.com/huggingface/lighteval/blob/a98210fd3a2d1e8bface1c32b72ebd5017173a4c/src/lighteval/models/base_model.py#L842
if single_token: inputs = [request.tokenized_context for request in batch] else: inputs = [ request.tokenized_context + request.tokenized_continuation[:-1] for request in batch ] # The last token (an eos) doesn't need to be given to the model
However, I cannot find where we strictly set "tokenized_continuation" to be ended with.
If this is true, then the eval results are not correct, especially for short tokenized_continuation