huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
MIT License
687 stars 78 forks source link

Handling last token in tokenized_continuation #203

Open simon900314 opened 3 months ago

simon900314 commented 3 months ago

Hi,

In:

https://github.com/huggingface/lighteval/blob/a98210fd3a2d1e8bface1c32b72ebd5017173a4c/src/lighteval/models/base_model.py#L842

if single_token: inputs = [request.tokenized_context for request in batch] else: inputs = [ request.tokenized_context + request.tokenized_continuation[:-1] for request in batch ] # The last token (an eos) doesn't need to be given to the model

However, I cannot find where we strictly set "tokenized_continuation" to be ended with .

If this is true, then the eval results are not correct, especially for short tokenized_continuation

clefourrier commented 2 months ago

Hi, thanks for the issue!

We use this function to tokenize the different representations, which will add bos/eos automatically if add_special_tokens is True.

However, for the base model, I just checked and the default is False if this parameter is not explicitely given in the model config, so I believe you are right!

cc @NathanHB for a double check

NathanHB commented 2 months ago

It seems you are right yes. I will double check on my side. Thanks for the catch !

JefferyChen453 commented 3 weeks ago

Hi, thanks for the issue!

We use this function to tokenize the different representations, which will add bos/eos automatically if add_special_tokens is True.

However, for the base model, I just checked and the default is False if this parameter is not explicitely given in the model config, so I believe you are right!

cc @NathanHB for a double check

So when using BaseModel() to load a hf format model, do i need to delete [:-1] in request.tokenized_context + request.tokenized_continuation[:-1] for request in batch?

I also discover this problem because the last token of the original choice is missing in prepared_batch fed into the model. But when I deleted [:-1] and run exact the same evaluation I didn't get expected results. The accuracy curve remains at random guess instead of an upward trend as before.

Is there anything else should I modify? Thank you!