Closed shan18 closed 2 months ago
See https://github.com/jzhang38/EasyContext/issues/8
In short, it is only counted as correct when the highest output logits index of all tokens in the answer span matches the answer.
(Imagine the model predicts the first answer token wrong but the remaining tokens are correct due to teacher forcing. It is still counted as incorrect)
It is called PPL-based eval and is used to save memory/latency because we only need on forward pass.
I see. Thanks a lot for the explanation.
Hi,
In
eval_needle.py
, I see that theanswer_ids
are being appended to the input prompt. https://github.com/jzhang38/EasyContext/blob/d6a7f2d74b08fc8049ec4a8146ef245051a669e3/eval_needle.py#L40Could you please help me understand why this was implemented this way?
Wouldn't that make the model generate output in teacher-forcing mode instead of doing autoregressive decoding?