wangshuai09 / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
5 stars 2 forks source link

[Misc]: How to build on Ascend NPU #1

Open wangshuai09 opened 1 week ago

wangshuai09 commented 1 week ago
  1. use dev docker docker pull ascendai/pytorch:2.1.0-ubuntu22.04
  2. into container docker run -p 2022:22 --name test-vllm --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info -itd ascendai/pytorch:2.1.0-ubuntu22.04 bash
  3. download vllm project
    yum install git
    pip uninstall torch_npu
    git clone https://github.com/wangshuai09/vllm
    cd vllm
    git chekcout npu_support
  4. install vllm VLLM_TARGET_DEVICE=npu pip install -e .
  5. test model python examples/offline_inference_npu.py

Before submitting a new issue...

beardog6 commented 6 days ago

目前只支持离线推理吗?openai服务接口运行报错: INFO 09-12 10:08:00 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU. INFO 09-12 10:08:00 selector.py:161] Using ASCEND_TORCH backend. Process SpawnProcess-1: Traceback (most recent call last): File "/root/miniconda3/envs/Python310/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/miniconda3/envs/Python310/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path) File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 34, in init self.engine = AsyncLLMEngine.from_engine_args( File "/vllm/vllm/engine/async_llm_engine.py", line 735, in from_engine_args engine = cls( File "/vllm/vllm/engine/async_llm_engine.py", line 615, in init self.engine = self._init_engine(*args, *kwargs) File "/vllm/vllm/engine/async_llm_engine.py", line 835, in _init_engine return engine_class(args, kwargs) File "/vllm/vllm/engine/async_llm_engine.py", line 262, in init super().init(*args, kwargs) File "/vllm/vllm/engine/llm_engine.py", line 324, in init self.model_executor = executor_class( File "/vllm/vllm/executor/executor_base.py", line 47, in init self._init_executor() File "/vllm/vllm/executor/gpu_executor.py", line 38, in _init_executor self.driver_worker = self._create_worker() File "/vllm/vllm/executor/gpu_executor.py", line 105, in _create_worker return create_worker(self._get_create_worker_kwargs( File "/vllm/vllm/executor/gpu_executor.py", line 24, in create_worker wrapper.init_worker(*kwargs) File "/vllm/vllm/worker/worker_base.py", line 449, in init_worker self.worker = worker_class(args, **kwargs) File "/vllm/vllm/worker/worker.py", line 99, in init self.model_runner: GPUModelRunnerBase = ModelRunnerClass( File "/vllm/vllm/worker/model_runner.py", line 888, in init self.attn_state = self.attn_backend.get_state_cls()( File "/vllm/vllm/attention/backends/abstract.py", line 43, in get_state_cls raise NotImplementedError NotImplementedError ERROR 09-12 10:08:02 api_server.py:188] RPCServer process died before responding to readiness probe

wangshuai09 commented 5 days ago

@beardog6 当前还在开发阶段,这些特性还没调试,欢迎进行合作开发,开发分支为npu_support

wangshuai09 commented 1 day ago

@beardog6 这个是你测试的场景吗

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}
beardog6 commented 15 hours ago

是的,启动参数有所不同 @wangshuai09

wangshuai09 commented 11 hours ago

是的,启动参数有所不同 @wangshuai09

我上面的测试通过了,你可以拉取最新的代码,看看你的参数可以跑通吗