vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.91k stars 4.3k forks source link

[Misc]: How to use intel-gpu in openvino #7418

Open liuxingbin opened 2 months ago

liuxingbin commented 2 months ago

Anything you want to discuss about vllm.

Hi, I successfully create the openvino env. I am wondering how to use the intel-gpu?

liuxingbin commented 2 months ago

I change the code in vllm/model_executor/model_loader/openvino.py to 'GPU' image

It turns out that [rank0]: RuntimeError: Exception from src/inference/src/cpp/core.cpp:104: [rank0]: Exception from src/inference/src/dev/plugin.cpp:53: [rank0]: Exception from src/plugins/intel_gpu/src/plugin/program_builder.cpp:246: [rank0]: Operation: PagedAttentionExtension_39914 of type PagedAttentionExtension(extension) is not supported openvino

ilya-lavrenov commented 1 month ago

Hi @liuxingbin Intel GPU support via OpenVINO is added in this PR https://github.com/vllm-project/vllm/pull/8192 Please, try it out.

liuxingbin commented 1 month ago

Hi I tried the PR, but new error occurred. I use openvino-gpu to run qwen2-0.5b. It turns out:

Traceback (most recent call last):
  File "/workspace/vllm/vllm/worker/openvino_worker.py", line 302, in determine_num_available_blocks
    kv_cache_size = self.profile_run()
  File "/workspace/vllm/vllm/worker/openvino_worker.py", line 549, in profile_run
    model_profile_run()
  File "/workspace/vllm/vllm/worker/openvino_worker.py", line 538, in model_profile_run
    self.model_runner.execute_model(seqs,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/vllm/worker/openvino_model_runner.py", line 340, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nncf/torch/dynamic_graph/wrappers.py", line 146, in wrapped
    return module_call(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/model_loader/openvino.py", line 164, in forward
    self.ov_request.wait()
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245:
Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
Caught exception: Check '!exceed_allocatable_mem_size' failed at src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cpp:139:
[GPU] Exceeded max size of memory object allocation: requested 19914555392 bytes, but max alloc size supported by device is 1073741824 bytes.Please try to reduce batch size or use lower precision.

19914555392 bytes equals 18.5GB. That's strange. I tried some solutions, but they didn't solve my problem. Any solution or hint?

FYI: I use gpu-version vllm to run qwen2-1.5b, which uses ~8GB by showing nvidia-smi.

sshlyapn commented 1 month ago

Hi @liuxingbin Can you share how you are running VLLM? Did you try setting a lower max_model_length value? We assume there should be enough GPU memory to run max_model_length tokens at once. If the model has some large max_model_length value, it could result in an error due to insufficient GPU memory

liuxingbin commented 1 month ago

I change the GPU available memory here, which solves my problem.