vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.87k stars 4.11k forks source link

[Bug]: ray error when tp>=2 #6090

Closed Jimmy-Lu closed 3 months ago

Jimmy-Lu commented 3 months ago

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.6
Libc version: glibc-2.31

Python version: 3.9.19 (main, May  6 2024, 19:43:03)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-144-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe

Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7542 32-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         1615.105
CPU max MHz:                     2900.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        5799.77
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-31,64-95
NUMA node1 CPU(s):               32-63,96-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Vulnerable
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] transformers              4.42.3                   pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     0-31,64-95      0               N/A
GPU1    SYS      X      SYS     32-63,96-127    1               N/A
NIC0    SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

πŸ› Describe the bug

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="/localssd/swlu/Qwen1.5-MoE-A2.7B-Chat/", tensor_parallel_size=2)

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

If I just run python vllm_test.pyas above, I got:

python vllm_test.py
2024-07-03 11:58:35,813 INFO worker.py:1724 -- Started a local Ray instance.
INFO 07-03 11:58:36 config.py:623] Defaulting to use mp for distributed inference
INFO 07-03 11:58:36 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/localssd/swlu/Qwen1.5-MoE-A2.7B-Chat/', speculative_config=None, tokenizer='/localssd/swlu/Qwen1.5-MoE-A2.7B-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/localssd/swlu/Qwen1.5-MoE-A2.7B-Chat/)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-03 11:58:42,096 INFO worker.py:1724 -- Started a local Ray instance.
[2024-07-03 11:58:43,413 E 35338 35338] core_worker.cc:215: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
ERROR 07-03 11:58:43 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 35338 died, exit code: 1
INFO 07-03 11:58:43 multiproc_worker_utils.py:123] Killing local vLLM worker processes

And then I run ray stop, I got

ray stop
Stopped only 0 out of 135 Ray processes within the grace period 16 seconds. Set `-v` to see more details. Remaining processes [psutil.Process(pid=28716, name='raylet', status='zombie', started='11:58:34'), psutil.Process(pid=29027, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28960, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28884, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28875, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28857, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28923, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28848, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28989, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28968, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28992, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=29025, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=29024, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28995, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28971, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28917, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28930, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28855, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28858, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28849, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28919, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28910, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28932, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28965, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28924, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28851, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28871, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28842, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28882, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28926, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28873, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28982, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28983, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28994, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28864, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28973, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28887, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28963, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28878, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28876, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28985, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28997, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28867, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28976, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28933, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28966, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=29000, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28880, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28956, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28869, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28978, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28969, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28861, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28959, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28874, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28856, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28844, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28847, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28792, name='python', status='terminated', started='11:58:34'), psutil.Process(pid=28922, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28988, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28991, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28967, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28913, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28935, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28970, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28794, name='python', status='terminated', started='11:58:34'), psutil.Process(pid=28854, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28845, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28990, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=29026, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28915, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28862, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28928, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28920, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28717, name='python', status='zombie', started='11:58:34'), psutil.Process(pid=28898, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28931, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28999, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28979, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28841, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28883, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28881, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28872, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28981, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28863, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28885, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28929, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28961, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28962, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28972, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28670, name='python', status='zombie', started='11:58:34'), psutil.Process(pid=28865, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28974, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=29028, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28888, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28964, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28911, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28879, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28870, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28852, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28916, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28840, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28957, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28860, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28993, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28996, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=29029, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28918, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28975, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28843, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28859, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28921, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28986, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=29022, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28912, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28925, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28850, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28853, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28914, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28927, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28984, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28846, name='ray::IDLE', status='terminated', started='11:58:34'), psutil.Process(pid=28987, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28866, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28671, name='python', status='zombie', started='11:58:34'), psutil.Process(pid=28886, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28877, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28998, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28868, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28977, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28934, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28980, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28958, name='ray::IDLE', status='terminated', started='11:58:35'), psutil.Process(pid=28508, name='gcs_server', status='zombie', started='11:58:32')] will be forcefully terminated.
You can also use `--force` to forcefully terminate processes or set higher `--grace-period` to wait longer time for proper termination.

If I run ray start --head --num-gpus 2 fisrt, the ray start correct. And then I run python vllm_test.py, I got

2024-07-03 12:00:47,549 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.233.75.126:6379...
2024-07-03 12:00:47,558 INFO worker.py:1724 -- Connected to Ray cluster.
INFO 07-03 12:00:48 config.py:623] Defaulting to use mp for distributed inference
INFO 07-03 12:00:48 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/localssd/swlu/Qwen1.5-MoE-A2.7B-Chat/', speculative_config=None, tokenizer='/localssd/swlu/Qwen1.5-MoE-A2.7B-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/localssd/swlu/Qwen1.5-MoE-A2.7B-Chat/)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-03 12:00:50,833 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.233.75.126:6379...
2024-07-03 12:00:50,842 INFO worker.py:1724 -- Connected to Ray cluster.
INFO 07-03 12:00:50 config.py:623] Defaulting to use mp for distributed inference
INFO 07-03 12:00:50 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/localssd/swlu/Qwen1.5-MoE-A2.7B-Chat/', speculative_config=None, tokenizer='/localssd/swlu/Qwen1.5-MoE-A2.7B-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/localssd/swlu/Qwen1.5-MoE-A2.7B-Chat/)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/root/miniconda3/envs/vllm/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/root/miniconda3/envs/vllm/lib/python3.9/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/root/miniconda3/envs/vllm/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/root/miniconda3/envs/vllm/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/root/miniconda3/envs/vllm/lib/python3.9/runpy.py", line 288, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/root/miniconda3/envs/vllm/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/root/miniconda3/envs/vllm/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/cephfs/swlu/code/test_scripts/vllm_test.py", line 11, in <module>
    llm = LLM(model="/localssd/swlu/Qwen1.5-MoE-A2.7B-Chat/", tensor_parallel_size=2)
  File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 144, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 363, in from_engine_args
    engine = cls(
  File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 223, in __init__
    self.model_executor = executor_class(
  File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 48, in _init_executor
    self.workers = [
  File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 49, in <listcomp>
    ProcessWorkerWrapper(
  File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 162, in __init__
    self.process.start()
  File "/root/miniconda3/envs/vllm/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/root/miniconda3/envs/vllm/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/root/miniconda3/envs/vllm/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/root/miniconda3/envs/vllm/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/root/miniconda3/envs/vllm/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/root/miniconda3/envs/vllm/lib/python3.9/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/root/miniconda3/envs/vllm/lib/python3.9/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
ERROR 07-03 12:00:52 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 46705 died, exit code: 1
INFO 07-03 12:00:52 multiproc_worker_utils.py:123] Killing local vLLM worker processes
^Z
[5]+  Stopped                 python vllm_test.py
Jimmy-Lu commented 3 months ago

ray version is 2.9.0

youkaichao commented 3 months ago

please see https://github.com/vllm-project/vllm/issues/5637

Jimmy-Lu commented 3 months ago

adding if __name__ == '__main__': works for me. But in the end there are some logs:

Processed prompts: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 18.26it/s, est. speed input: 100.47 toks/s, output: 292.28 toks/s]
Prompt: 'Hello, my name is', Generated text: " Kees. I am a very passionate and professional photographer.\nI've always been"
Prompt: 'The president of the United States is', Generated text: ' the head of state and the government. A. ι”™θ―― B.'
Prompt: 'The capital of France is', Generated text: ' a city that is full of history, culture and beauty. It is a city'
Prompt: 'The future of AI is', Generated text: ' bright and it will be a part of our lives in some way or the other'
*** SIGTERM received at time=1719989502 on cpu 71 ***
PC: @     0x7f74170f2374  (unknown)  pthread_cond_wait@@GLIBC_2.3.2
    @     0x7f7416dda090  (unknown)  (unknown)
    @ ... and at least 1 more frames
[2024-07-03 14:51:42,304 E 17965 10670] logging.cc:440: *** SIGTERM received at time=1719989502 on cpu 71 ***
[2024-07-03 14:51:42,304 E 17965 10670] logging.cc:440: PC: @     0x7f74170f2374  (unknown)  pthread_cond_wait@@GLIBC_2.3.2
[2024-07-03 14:51:42,304 E 17965 10670] logging.cc:440:     @     0x7f7416dda090  (unknown)  (unknown)
[2024-07-03 14:51:42,304 E 17965 10670] logging.cc:440:     @ ... and at least 1 more frames
INFO 07-03 14:51:46 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Are these ok?

youkaichao commented 3 months ago

these are some benign erros that @njhill should be working on. It should not affect your inference task I think.

Jimmy-Lu commented 3 months ago

yes,no effect. just curious