vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.33k stars 4.76k forks source link

[Bug]: Fail to use CUDA with multiprocessing (llama_3_8b) #10800

Open yliu2702 opened 4 days ago

yliu2702 commented 4 days ago

Your current environment

```text Your output of `python collect_env.py` here ```

Model Input Dumps

No response

🐛 Describe the bug

When I load LLM as: llm = LLM( model= model_id, tokenizer= model_id, download_dir = cache_dir, dtype='half', tensor_parallel_size = 2, gpu_memory_utilization=0.75, enable_lora = False)

I get error as RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method I tried loading llama_3_8b using huggingface, llm generation can be completed using 2 GPU, I just try vllm to speed up the generation process. Can anyone help me with this error? Thanks a lot!

Best, Yi Nov 30th, 2024

Before submitting a new issue...

jeejeelee commented 3 days ago

Could you plz provibe your running script? In gereral , this kind of issue occurs when CUDA has already been initialized

yliu2702 commented 3 days ago

I will be assigned 2 GPU (or more if I required). Like n_devices = torch.cuda.device_count() if torch.cuda.is_available() else 1, n_devices = 2. But I can't load vllm model. Can you explain more? Thank you!

jeejeelee commented 3 days ago

torch.cuda.is_available() leads to the error you mentioned above. You can remove it and try again

Liuqh12 commented 1 day ago

torch.cuda.get_device_capability('cuda:0')should avoid too.