Closed ggbetz closed 4 days ago
I got
Please use Flashinfer backend for models withlogits_soft_cap (i.e., Gemma-2).
Otherwise, the output might be wrong. Set Flashinfer backend by export
VLLM_ATTENTION_BACKEND=FLASHINFER. (type=value_error)
We might consider to re-run the evals for Gemma 1.
I've added flashinfer to our docker contaner, but still get an error when trying to run and evaluate gemma2:
INFO 08-01 10:22:08 selector.py:79] Using Flashinfer backend.
WARNING 08-01 10:22:08 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
INFO 08-01 10:22:08 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 08-01 10:23:11 model_runner.py:255] Loading model weights took 4.9975 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/bin/cot-eval", line 8, in <module>
[rank0]: sys.exit(main())
[rank0]: File "/workspace/cot-eval/src/cot_eval/__main__.py", line 149, in main
[rank0]: llm = VLLM(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/pydantic/v1/main.py", line 339, in __init__
[rank0]: values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/pydantic/v1/main.py", line 1050, in validate_model
[rank0]: input_data = validator(cls_, input_data)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/langchain_core/utils/pydantic.py", line 146, in wrapper
[rank0]: return func(cls, values)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/langchain_community/llms/vllm.py", line 89, in validate_environment
[rank0]: values["client"] = VLLModel(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 149, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 414, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 256, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 353, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 76, in determine_num_available_blocks
[rank0]: return self.driver_worker.determine_num_available_blocks()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 173, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 874, in profile_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1221, in execute_model
[rank0]: model_input.attn_metadata.begin_forward()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flashinfer.py", line 132, in begin_forward
[rank0]: self.prefill_wrapper.begin_forward(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 791, in begin_forward
[rank0]: self._wrapper.begin_forward(
[rank0]: RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257
We'll probably have to wait for the next vllm release. See:
https://github.com/flashinfer-ai/flashinfer/issues/362 https://github.com/vllm-project/vllm/pull/7008
Check upon issue creation:
Parameters with XXX in [9b, 27b]:
ToDos: