vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.83k stars 4.69k forks source link

[Feature]: support for Cambricon MLU #9649

Open a120092009 opened 1 month ago

a120092009 commented 1 month ago

🚀 The feature, motivation and pitch

I am a developer from Cambricon, an AI chip vendor in China. We have already supported vLLM 0.6.1.post2 on Cambricon MLU internally. We wish to contribute the MLU adaptation code to the vLLM project, and the pull request (PR) will be ready in November. Additionally, we welcome contributions from other developers.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

BANGzys commented 1 month ago

您好我在使用vllm启动大模型的时候报错,大概的报错原因是vllm默认找英伟达的cuda,那么怎么才能用机器上的寒武纪显卡运行呢。 报错: (vllm2) [root@localhost envs]# vllm serve /home/models/qwen2-72b-instruct-int4/ --tensor-parallel-size 4 --host 0.0.0.0 --port 8000 WARNING 10-22 18:22:03 _custom_ops.py:19] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') INFO 10-22 18:22:07 api_server.py:528] vLLM API server version 0.6.3.post1 INFO 10-22 18:22:07 api_server.py:529] args: Namespace(subparser='serve', model_tag='/home/models/qwen2-72b-instruct-int4/', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/models/qwen2-72b-instruct-int4/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7fec8402fac0>) INFO 10-22 18:22:07 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/bd7d65b0-3358-496e-a3c5-e8c5a4d62c39 for IPC Path. INFO 10-22 18:22:07 api_server.py:179] Started engine process with PID 101689 Traceback (most recent call last): File "/root/anaconda3/envs/vllm2/bin/vllm", line 8, in sys.exit(main()) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/scripts.py", line 195, in main args.dispatch_function(args) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/scripts.py", line 41, in serve uvloop.run(run_server(args)) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/uvloop/init.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper return await main File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server async with build_async_engine_client(args) as engine_client: File "/root/anaconda3/envs/vllm2/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/root/anaconda3/envs/vllm2/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 184, in build_async_engine_client_from_engine_args engine_config = engine_args.create_engine_config() File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 902, in create_engine_config device_config = DeviceConfig(device=self.device) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/config.py", line 1091, in init raise RuntimeError("Failed to infer device type") RuntimeError: Failed to infer device type WARNING 10-22 18:22:10 _custom_ops.py:19] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Process SpawnProcess-1: Traceback (most recent call last): File "/root/anaconda3/envs/vllm2/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/anaconda3/envs/vllm2/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(self._args, **self._kwargs) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 135, in from_engine_args engine_config = engine_args.create_engine_config() File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 902, in create_engine_config device_config = DeviceConfig(device=self.device) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/config.py", line 1091, in init raise RuntimeError("Failed to infer device type") RuntimeError: Failed to infer device type

a120092009 commented 4 weeks ago

您好我在使用vllm启动大模型的时候报错,大概的报错原因是vllm默认找英伟达的cuda,那么怎么才能用机器上的寒武纪显卡运行呢。 报错: (vllm2) [root@localhost envs]# vllm serve /home/models/qwen2-72b-instruct-int4/ --tensor-parallel-size 4 --host 0.0.0.0 --port 8000 WARNING 10-22 18:22:03 _custom_ops.py:19] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') INFO 10-22 18:22:07 api_server.py:528] vLLM API server version 0.6.3.post1 INFO 10-22 18:22:07 api_server.py:529] args: Namespace(subparser='serve', model_tag='/home/models/qwen2-72b-instruct-int4/', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowedorigins=[''], allowedmethods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/models/qwen2-72b-instruct-int4/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7fec8402fac0>) INFO 10-22 18:22:07 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/bd7d65b0-3358-496e-a3c5-e8c5a4d62c39 for IPC Path. INFO 10-22 18:22:07 api_server.py:179] Started engine process with PID 101689 Traceback (most recent call last): File "/root/anaconda3/envs/vllm2/bin/vllm", line 8, in sys.exit(main()) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/scripts.py", line 195, in main args.dispatch_function(args) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/scripts.py", line 41, in serve uvloop.run(run_server(args)) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/uvloop/init.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper return await main File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server async with build_async_engine_client(args) as engine_client: File "/root/anaconda3/envs/vllm2/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/root/anaconda3/envs/vllm2/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 184, in build_async_engine_client_from_engine_args engine_config = engine_args.create_engine_config() File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 902, in create_engine_config device_config = DeviceConfig(device=self.device) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/config.py", line 1091, in init raise RuntimeError("Failed to infer device type") RuntimeError: Failed to infer device type WARNING 10-22 18:22:10 _custom_ops.py:19] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Process SpawnProcess-1: Traceback (most recent call last): File "/root/anaconda3/envs/vllm2/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/anaconda3/envs/vllm2/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 135, in from_engine_args engine_config = engine_args.create_engine_config() File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 902, in create_engine_config device_config = DeviceConfig(device=self.device) File "/root/anaconda3/envs/vllm2/lib/python3.10/site-packages/vllm/config.py", line 1091, in init** raise RuntimeError("Failed to infer device type") RuntimeError: Failed to infer device type

vLLM在MLU的适配需要获取寒武纪最新的SDK。麻烦从购买渠道联系下寒武纪技术支持团队。