DROP Evaluation with Llama3 (vs. lm-evaluation-harness)

Evaluating Llama-3-8B on DROP throws a warning with the standard configuration (3-shot), as reported in Llama3, suggesting that the input size is greater than the maximum context size allowed by the model:

The smallest context of your batch (10010) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.

Here is the command I use:

accelerate launch --num_processes=1 run_evals_accelerate.py \
    --model_args "pretrained=meta-llama/Meta-Llama-3-8B" \
    --tasks "lighteval|drop|3|0" \
    --override_batch_size 16 \
    --output_dir "./log/"

I am able to reproduce this even after progressively reducing the batch size to 1.

Log:

WARNING:lighteval.logging.hierarchical_logger:    Model info: ModelInfo(model_name='meta-llama/Meta-Llama-3-8B', model_sha='561487d18c41c76bcb5fc6cfb73a324982f04f47', model_dtype='torch.bfloat16', model_size='15.08 GB')
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:15.762582]
WARNING:lighteval.logging.hierarchical_logger:  Tasks loading {
WARNING:lighteval.logging.hierarchical_logger:    If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`.
WARNING:lighteval.logging.hierarchical_logger:    lighteval/drop_harness default
WARNING:lighteval.logging.hierarchical_logger:    Loading documents, and requests
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:34.926653]
WARNING:lighteval.logging.hierarchical_logger:  Setting seeds and waiting for all processes {
WARNING:lighteval.logging.hierarchical_logger:    setting seed to 1234 for random and numpy
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:00.000371]
WARNING:lighteval.logging.hierarchical_logger:  Evaluation {
WARNING:lighteval.logging.hierarchical_logger:    Evaluate on 1 tasks.
WARNING:lighteval.logging.hierarchical_logger:    Running RequestType.GREEDY_UNTIL requests
Splits:   0%|                                                                                                                                                    | 0/4 [00:00<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (10010) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.  0/38 [00:00<?, ?it/s]

The process then either stays stuck indefinitely until manually killed, or crashes as follows:

note: The following traceback happened even after reducing the batch size to 1 with --override_batch_size.

WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9262) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:55,  3.48it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9192) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:54,  3.48it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9538) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:54,  3.48it/s]
Splits:   0%|                                                                                                                                                                                                                             | 0/4 [00:40<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger:  } [0:01:02.769634]
WARNING:lighteval.logging.hierarchical_logger:} [0:01:49.892104]
Traceback (most recent call last):
  File "/home/vipul.raheja/lighteval/run_evals_accelerate.py", line 82, in <module>
    main(args)
  File "/home/vipul.raheja/lighteval/src/lighteval/logging/hierarchical_logger.py", line 166, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/main_accelerate.py", line 111, in main
    evaluation_tracker = evaluate(
                         ^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/evaluator.py", line 86, in evaluate
    full_resps = lm.greedy_until(requests, override_bs=override_bs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/models/base_model.py", line 570, in greedy_until
    max_new_tokens = min(self.max_length - biggest_context, max_new_tokens)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Running the same evaluation directly in lm-evaluation-harness does not throw any such warning and proceeds at a reasonable speed.

~/lm-evaluation-harness$ lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B --tasks drop --device cuda:0 --batch_size 16
2024-04-21:20:19:29,714 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-21:20:19:33,062 INFO     [__main__.py:335] Selected Tasks: ['drop']
2024-04-21:20:19:33,063 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-04-21:20:19:33,064 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'meta-llama/Meta-Llama-3-8B'}
2024-04-21:20:19:33,164 INFO     [huggingface.py:164] Using device 'cuda:0'
Loading checkpoint shards: 100%|█████████████████████████████| 4/4 [00:06<00:00,  1.62s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading builder script: 100%|█████████████████████████████| 7.46k/7.46k [00:00<00:00, 35.8MB/s]
Downloading readme: 100%|█████████████████████████████| 26.0/26.0 [00:00<00:00, 384kB/s]
Downloading data: 100%|█████████████████████████████| 8.31M/8.31M [00:00<00:00, 8.66MB/s]
Generating train split: 77409 examples [00:05, 13452.43 examples/s]
Generating validation split: 9536 examples [00:00, 11649.32 examples/s]
Map: 100%|█████████████████████████████| 77409/77409 [00:10<00:00, 7060.41 examples/s]
Map: 100%|█████████████████████████████| 9536/9536 [00:01<00:00, 4788.74 examples/s]
2024-04-21:20:20:11,675 INFO     [task.py:395] Building contexts for drop on rank 0...
100%|█████████████████████████████| 9536/9536 [00:03<00:00, 2793.13it/s]
2024-04-21:20:20:16,260 INFO     [evaluator.py:379] Running generate_until requests
Running generate_until requests:   9%|█████▊                             | 833/9536 [07:44<1:06:05,  2.19it/s]

Env: transformers version: 4.39.3 Platform: Ubuntu 20.04.6 LTS Python version: 3.11.9 Huggingface_hub version: 0.22.2 Safetensors version: 0.4.2 Accelerate version: 0.29.2 Lighteval version: 0.4.0.dev0

huggingface / lighteval

DROP Evaluation with Llama3 (vs. lm-evaluation-harness) #165