Open wangshuai09 opened 1 week ago
目前只支持离线推理吗?openai服务接口运行报错: INFO 09-12 10:08:00 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU. INFO 09-12 10:08:00 selector.py:161] Using ASCEND_TORCH backend. Process SpawnProcess-1: Traceback (most recent call last): File "/root/miniconda3/envs/Python310/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/miniconda3/envs/Python310/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path) File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 34, in init self.engine = AsyncLLMEngine.from_engine_args( File "/vllm/vllm/engine/async_llm_engine.py", line 735, in from_engine_args engine = cls( File "/vllm/vllm/engine/async_llm_engine.py", line 615, in init self.engine = self._init_engine(*args, *kwargs) File "/vllm/vllm/engine/async_llm_engine.py", line 835, in _init_engine return engine_class(args, kwargs) File "/vllm/vllm/engine/async_llm_engine.py", line 262, in init super().init(*args, kwargs) File "/vllm/vllm/engine/llm_engine.py", line 324, in init self.model_executor = executor_class( File "/vllm/vllm/executor/executor_base.py", line 47, in init self._init_executor() File "/vllm/vllm/executor/gpu_executor.py", line 38, in _init_executor self.driver_worker = self._create_worker() File "/vllm/vllm/executor/gpu_executor.py", line 105, in _create_worker return create_worker(self._get_create_worker_kwargs( File "/vllm/vllm/executor/gpu_executor.py", line 24, in create_worker wrapper.init_worker(*kwargs) File "/vllm/vllm/worker/worker_base.py", line 449, in init_worker self.worker = worker_class(args, **kwargs) File "/vllm/vllm/worker/worker.py", line 99, in init self.model_runner: GPUModelRunnerBase = ModelRunnerClass( File "/vllm/vllm/worker/model_runner.py", line 888, in init self.attn_state = self.attn_backend.get_state_cls()( File "/vllm/vllm/attention/backends/abstract.py", line 43, in get_state_cls raise NotImplementedError NotImplementedError ERROR 09-12 10:08:02 api_server.py:188] RPCServer process died before responding to readiness probe
@beardog6 当前还在开发阶段,这些特性还没调试,欢迎进行合作开发,开发分支为npu_support
@beardog6 这个是你测试的场景吗
# start server
vllm serve facebook/opt-125m
# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'
# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live. I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}
是的,启动参数有所不同 @wangshuai09
是的,启动参数有所不同 @wangshuai09
我上面的测试通过了,你可以拉取最新的代码,看看你的参数可以跑通吗
docker pull ascendai/pytorch:2.1.0-ubuntu22.04
docker run -p 2022:22 --name test-vllm --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info -itd ascendai/pytorch:2.1.0-ubuntu22.04 bash
VLLM_TARGET_DEVICE=npu pip install -e .
python examples/offline_inference_npu.py
Before submitting a new issue...