Open ducanh-ho2296 opened 2 months ago
I have the same question as you when I use the lm-eval to evaluate the LLMs. Do you have solved this question? my command and info are as follows:
lm-eval --model vllm --model_args pretrained=/home/T3090U1/CZ/model/Qwen1.5-7B-Chat/,dtype=auto,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,max_model_len=4096 --tasks=leaderboard --batch_size=auto --output_path=/home/T3090U1/CZ/work3/output
error:
`rank0: Traceback (most recent call last):
rank0: File "/home/T3090U1/anaconda3/envs/work3/bin/lm-eval", line 8, in
rank0: File "/home/T3090U1/CZ/work3/lm_eval/main.py", line 369, in cli_evaluate rank0: results = evaluator.simple_evaluate( rank0: File "/home/T3090U1/CZ/work3/lm_eval/utils.py", line 395, in _wrapper rank0: return fn(*args, kwargs) rank0: File "/home/T3090U1/CZ/work3/lm_eval/evaluator.py", line 277, in simple_evaluate rank0: results = evaluate( rank0: File "/home/T3090U1/CZ/work3/lm_eval/utils.py", line 395, in _wrapper rank0: return fn(*args, *kwargs) rank0: File "/home/T3090U1/CZ/work3/lm_eval/evaluator.py", line 444, in evaluate rank0: resps = getattr(lm, reqtype)(cloned_reqs) rank0: File "/home/T3090U1/CZ/work3/lm_eval/api/model.py", line 370, in loglikelihood rank0: return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm) rank0: File "/home/T3090U1/CZ/work3/lm_eval/models/vllm_causallms.py", line 415, in _loglikelihood_tokens rank0: outputs = self._model_generate(requests=inputs, generate=False) rank0: File "/home/T3090U1/CZ/work3/lm_eval/models/vllm_causallms.py", line 248, in _model_generate rank0: outputs = self.model.generate( rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/utils.py", line 1036, in inner rank0: return fn(args, kwargs) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 348, in generate rank0: outputs = self._run_engine(use_tqdm=use_tqdm) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 715, in _run_engine rank0: step_outputs = self.llm_engine.step() rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 1223, in step rank0: outputs = self.model_executor.execute_model( rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 78, in execute_model rank0: driver_outputs = self._driver_execute_model(execute_model_req) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 162, in _driver_execute_model rank0: return self.driver_worker.execute_model(execute_model_req) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/worker/worker_base.py", line 327, in execute_model rank0: output = self.model_runner.execute_model( rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/worker/model_runner_base.py", line 125, in _wrapper rank0: pickle.dump(dumped_inputs, filep) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 563, in reduce rank0: raise RuntimeError("LLMEngine should not be pickled!") rank0: RuntimeError: LLMEngine should not be pickled! `
I have the same question as you when I use the lm-eval to evaluate the LLMs. Do you have solved this question? my command and info are as follows:
lm-eval --model vllm --model_args pretrained=/home/T3090U1/CZ/model/Qwen1.5-7B-Chat/,dtype=auto,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,max_model_len=4096 --tasks=leaderboard --batch_size=auto --output_path=/home/T3090U1/CZ/work3/output
error:
[rank0]: Traceback (most recent call last): [rank0]: File "/home/T3090U1/anaconda3/envs/work3/bin/lm-eval", line 8, in <module> [rank0]: sys.exit(cli_evaluate()) [rank0]: File "/home/T3090U1/CZ/work3/lm_eval/__main__.py", line 369, in cli_evaluate [rank0]: results = evaluator.simple_evaluate( [rank0]: File "/home/T3090U1/CZ/work3/lm_eval/utils.py", line 395, in _wrapper [rank0]: return fn(*args, **kwargs) [rank0]: File "/home/T3090U1/CZ/work3/lm_eval/evaluator.py", line 277, in simple_evaluate [rank0]: results = evaluate( [rank0]: File "/home/T3090U1/CZ/work3/lm_eval/utils.py", line 395, in _wrapper [rank0]: return fn(*args, **kwargs) [rank0]: File "/home/T3090U1/CZ/work3/lm_eval/evaluator.py", line 444, in evaluate [rank0]: resps = getattr(lm, reqtype)(cloned_reqs) [rank0]: File "/home/T3090U1/CZ/work3/lm_eval/api/model.py", line 370, in loglikelihood [rank0]: return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm) [rank0]: File "/home/T3090U1/CZ/work3/lm_eval/models/vllm_causallms.py", line 415, in _loglikelihood_tokens [rank0]: outputs = self._model_generate(requests=inputs, generate=False) [rank0]: File "/home/T3090U1/CZ/work3/lm_eval/models/vllm_causallms.py", line 248, in _model_generate [rank0]: outputs = self.model.generate( [rank0]: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/utils.py", line 1036, in inner [rank0]: return fn(*args, **kwargs) [rank0]: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 348, in generate [rank0]: outputs = self._run_engine(use_tqdm=use_tqdm) [rank0]: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 715, in _run_engine [rank0]: step_outputs = self.llm_engine.step() [rank0]: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 1223, in step [rank0]: outputs = self.model_executor.execute_model( [rank0]: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 78, in execute_model [rank0]: driver_outputs = self._driver_execute_model(execute_model_req) [rank0]: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 162, in _driver_execute_model [rank0]: return self.driver_worker.execute_model(execute_model_req) [rank0]: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/worker/worker_base.py", line 327, in execute_model [rank0]: output = self.model_runner.execute_model( [rank0]: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/worker/model_runner_base.py", line 125, in _wrapper [rank0]: pickle.dump(dumped_inputs, filep) [rank0]: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 563, in __reduce__ [rank0]: raise RuntimeError("LLMEngine should not be pickled!") [rank0]: RuntimeError: LLMEngine should not be pickled!
Sorry but there is still no answer for this question
My current environment
Model Input Dumps
No response
🐛 Describe the bug
I'M working with serving LoRA adapter dynamically with:
The model with name
meta-llama/Meta-Llama-3.1-8B-Instruct
is now running in a kubenetes Pod with a single GPU A100, after that I usedlm evaluation harness framework
https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#model-apis-and-inference-servers for benchmarking the model:OUTPUT
I would like to know if this is a bug from vllm where the request not in a queue and causing overloading vllm server, or the error is coming from somewhere else, could anyone help me for this case? Thank you very much in advance!
Before submitting a new issue...