Open vipulraheja opened 2 months ago
Iirc, the harness just does not check if the context fits within the max length of the model (the few shot context is built here and used there - only the gold prediction must fit within max length).
We have decided to print a warning when the context is too long for the max length, as it means that the model is likely to have non trivial issues when working. However, the bug you're betting are not normal, I'll check them.
Evaluating Llama-3-8B on DROP throws a warning with the standard configuration (3-shot), as reported in Llama3, suggesting that the input size is greater than the maximum context size allowed by the model:
Here is the command I use:
I am able to reproduce this even after progressively reducing the batch size to 1.
Log:
The process then either stays stuck indefinitely until manually killed, or crashes as follows:
note: The following traceback happened even after reducing the batch size to 1 with
--override_batch_size
.Running the same evaluation directly in
lm-evaluation-harness
does not throw any such warning and proceeds at a reasonable speed.Env: transformers version: 4.39.3 Platform: Ubuntu 20.04.6 LTS Python version: 3.11.9 Huggingface_hub version: 0.22.2 Safetensors version: 0.4.2 Accelerate version: 0.29.2 Lighteval version: 0.4.0.dev0