[Bug]: Fail to use CUDA with multiprocessing (llama_3_8b)

yliu2702 commented 1 week ago

Your current environment

```text Your output of `python collect_env.py` here ```

Model Input Dumps

No response

🐛 Describe the bug

When I load LLM as: llm = LLM( model= model_id, tokenizer= model_id, download_dir = cache_dir, dtype='half', tensor_parallel_size = 2, gpu_memory_utilization=0.75, enable_lora = False)

I get error as RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method I tried loading llama_3_8b using huggingface, llm generation can be completed using 2 GPU, I just try vllm to speed up the generation process. Can anyone help me with this error? Thanks a lot!

Best, Yi Nov 30th, 2024

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

jeejeelee commented 1 week ago

Could you plz provibe your running script? In gereral , this kind of issue occurs when CUDA has already been initialized

yliu2702 commented 1 week ago

I will be assigned 2 GPU (or more if I required). Like n_devices = torch.cuda.device_count() if torch.cuda.is_available() else 1, n_devices = 2. But I can't load vllm model. Can you explain more? Thank you!

jeejeelee commented 1 week ago

torch.cuda.is_available() leads to the error you mentioned above. You can remove it and try again

Liuqh12 commented 1 week ago

torch.cuda.get_device_capability('cuda:0')should avoid too.

vllm-project / vllm