Open liuxingbin opened 2 months ago
I change the code in vllm/model_executor/model_loader/openvino.py
to 'GPU'
It turns out that
[rank0]: RuntimeError: Exception from src/inference/src/cpp/core.cpp:104: [rank0]: Exception from src/inference/src/dev/plugin.cpp:53: [rank0]: Exception from src/plugins/intel_gpu/src/plugin/program_builder.cpp:246: [rank0]: Operation: PagedAttentionExtension_39914 of type PagedAttentionExtension(extension) is not supported openvino
Hi @liuxingbin Intel GPU support via OpenVINO is added in this PR https://github.com/vllm-project/vllm/pull/8192 Please, try it out.
Hi I tried the PR, but new error occurred. I use openvino-gpu to run qwen2-0.5b. It turns out:
Traceback (most recent call last):
File "/workspace/vllm/vllm/worker/openvino_worker.py", line 302, in determine_num_available_blocks
kv_cache_size = self.profile_run()
File "/workspace/vllm/vllm/worker/openvino_worker.py", line 549, in profile_run
model_profile_run()
File "/workspace/vllm/vllm/worker/openvino_worker.py", line 538, in model_profile_run
self.model_runner.execute_model(seqs,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/vllm/vllm/worker/openvino_model_runner.py", line 340, in execute_model
hidden_states = model_executable(**execute_model_kwargs)
File "/usr/local/lib/python3.10/dist-packages/nncf/torch/dynamic_graph/wrappers.py", line 146, in wrapped
return module_call(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/vllm/vllm/model_executor/model_loader/openvino.py", line 164, in forward
self.ov_request.wait()
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245:
Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
Caught exception: Check '!exceed_allocatable_mem_size' failed at src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cpp:139:
[GPU] Exceeded max size of memory object allocation: requested 19914555392 bytes, but max alloc size supported by device is 1073741824 bytes.Please try to reduce batch size or use lower precision.
19914555392 bytes equals 18.5GB. That's strange. I tried some solutions, but they didn't solve my problem. Any solution or hint?
FYI: I use gpu-version vllm to run qwen2-1.5b, which uses ~8GB by showing nvidia-smi.
Hi @liuxingbin Can you share how you are running VLLM? Did you try setting a lower max_model_length value? We assume there should be enough GPU memory to run max_model_length tokens at once. If the model has some large max_model_length value, it could result in an error due to insufficient GPU memory
I change the GPU available memory here, which solves my problem.
Anything you want to discuss about vllm.
Hi, I successfully create the openvino env. I am wondering how to use the intel-gpu?