lm-evaluation-harness broken on master

pcmoritz commented 8 months ago

Since https://github.com/vllm-project/vllm/pull/3065, the eval suite https://github.com/EleutherAI/lm-evaluation-harness is broken.

Repro (this should be run on 2 A100s or H100s to make sure the Mixtral model fits into GPU memory):

# First install vllm from master via https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source

# Then clone an install https://github.com/EleutherAI/lm-evaluation-harness
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

# Now run the evaluation harness
lm_eval --model vllm --model_args pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=2 --tasks mmlu --num_fewshot 5

This fails with

  File "/home/ray/anaconda3/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/ray/default/lm-evaluation-harness/lm_eval/__main__.py", line 318, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/ray/default/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/ray/default/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate
    results = evaluate(
  File "/home/ray/default/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/ray/default/lm-evaluation-harness/lm_eval/evaluator.py", line 368, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "/home/ray/default/lm-evaluation-harness/lm_eval/api/model.py", line 321, in loglikelihood
    return self._loglikelihood_tokens(new_reqs)
  File "/home/ray/default/lm-evaluation-harness/lm_eval/models/vllm_causallms.py", line 379, in _loglikelihood_tokens
    answer = self._parse_logprobs(
  File "/home/ray/default/lm-evaluation-harness/lm_eval/models/vllm_causallms.py", line 416, in _parse_logprobs
    continuation_logprobs = sum(
TypeError: unsupported operand type(s) for +: 'int' and 'Logprob'

The API breakage is fixed in https://github.com/EleutherAI/lm-evaluation-harness/pull/1549, but after the fix it is extremely slow (about 40x slower than before), so not really feasible to run:

Running loglikelihood requests:   0%|                  [...]               | 32/56168 [22:52<668:47:47, 42.89s/it]

Being able to run the evaluation harness in a timely manner is crucial so we can ensure model performance doesn't degrade.

baberabb commented 8 months ago

I think this is because without specifying a batch size the harness defaults to bs 1. Should be fixed if you use --batch_size auto and we can take advantage of vLLM's continuous batching.

Sshubam commented 1 month ago

@pcmoritz did you solve this? Im facing similar issue.

vllm-project / vllm

lm-evaluation-harness broken on master #3292