vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.75k stars 4.49k forks source link

[Bug]: #7077

Closed yanghengjian closed 3 months ago

yanghengjian commented 3 months ago

Your current environment

vllm 0.5.3.post1 vllm-flash-attn 2.5.9.post1

🐛 Describe the bug

(agiclass) root@autodl-container-c9174bac52-9e557856:~# cd /root/autodl-tmp/project/vllm/vllm/entrypoints/openai (agiclass) root@autodl-container-c9174bac52-9e557856:~/autodl-tmp/project/vllm/vllm/entrypoints/openai# python api_server.py --model /root/autodl-tmp/llm/Qwen2-72B-Instruct-GPTQ-Int4 --quantization gptq --tensor_parallel_size 4 --port 6006 INFO 08-02 18:15:14 api_server.py:309] vLLM API server version 0.5.3.post1 INFO 08-02 18:15:14 api_server.py:310] args: Namespace(host=None, port=6006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/root/autodl-tmp/llm/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='gptq', rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) INFO 08-02 18:15:14 gptq_marlin.py:91] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference WARNING 08-02 18:15:14 config.py:246] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 08-02 18:15:14 config.py:715] Defaulting to use mp for distributed inference INFO 08-02 18:15:14 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/root/autodl-tmp/llm/Qwen2-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/root/autodl-tmp/llm/Qwen2-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/root/autodl-tmp/llm/Qwen2-72B-Instruct-GPTQ-Int4, use_v2_block_manager=False, enable_prefix_caching=False) INFO 08-02 18:15:15 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=13359) INFO 08-02 18:15:15 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=13360) INFO 08-02 18:15:15 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=13359) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last): (VllmWorkerProcess pid=13359) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=13359) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] output = executor(args, kwargs) (VllmWorkerProcess pid=13359) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/vllm/worker/worker.py", line 123, in init_device (VllmWorkerProcess pid=13359) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device) (VllmWorkerProcess pid=13359) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device (VllmWorkerProcess pid=13359) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device) (VllmWorkerProcess pid=13359) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/torch/cuda/init.py", line 279, in _lazy_init (VllmWorkerProcess pid=13359) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] raise RuntimeError( (VllmWorkerProcess pid=13359) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method (VllmWorkerProcess pid=13359) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=13360) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last): (VllmWorkerProcess pid=13360) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=13360) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] output = executor(*args, *kwargs) (VllmWorkerProcess pid=13360) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/vllm/worker/worker.py", line 123, in init_device (VllmWorkerProcess pid=13360) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device) (VllmWorkerProcess pid=13360) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device (VllmWorkerProcess pid=13360) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device) (VllmWorkerProcess pid=13360) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/torch/cuda/init.py", line 279, in _lazy_init (VllmWorkerProcess pid=13360) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] raise RuntimeError( (VllmWorkerProcess pid=13360) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method (VllmWorkerProcess pid=13360) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=13361) INFO 08-02 18:15:15 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=13361) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last): (VllmWorkerProcess pid=13361) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=13361) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] output = executor(args, kwargs) (VllmWorkerProcess pid=13361) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/vllm/worker/worker.py", line 123, in init_device (VllmWorkerProcess pid=13361) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device) (VllmWorkerProcess pid=13361) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device (VllmWorkerProcess pid=13361) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device) (VllmWorkerProcess pid=13361) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/agiclass/lib/python3.10/site-packages/torch/cuda/init.py", line 279, in _lazy_init (VllmWorkerProcess pid=13361) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] raise RuntimeError( (VllmWorkerProcess pid=13361) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method (VllmWorkerProcess pid=13361) ERROR 08-02 18:15:15 multiproc_worker_utils.py:226]

image

使用embedding 出问题

Jerry-jwz commented 3 months ago

I met the same problem, how did you solve it?

Zhihuihuo commented 3 months ago

The same problem,when attempting to launch the qwen72B model with 4 GPU cards,how to solve the problem?

Environment: torch 2.3.1 transformers 4.43.3 transformers-stream-generator 0.0.4 vllm 0.5.3.post1 vllm-flash-attn 2.5.9.post1

image

tangtianyi19980709 commented 3 months ago

hi did you solve this problem?

LSC527 commented 3 months ago

same problem with vllm==0.5.4