[Bug]: Unable to load Llama-3.1-70B-Instruct using either `vllm serve` or `vllm-openai` docker

SMAntony commented 3 weeks ago

Your current environment

The output of `python collect_env.py`

```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 24.04 LTS (x86_64) GCC version: (Ubuntu 13.2.0-23ubuntu4) 13.2.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39 Python version: 3.10.15 (main, Nov 8 2024, 11:29:01) [GCC 13.2.0] (64-bit runtime) Python platform: Linux-6.8.0-1018-aws-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.0.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G GPU 1: NVIDIA A10G GPU 2: NVIDIA A10G GPU 3: NVIDIA A10G Nvidia driver version: 555.42.06 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R32 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 0 BogoMIPS: 5599.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 768 KiB (24 instances) L1i cache: 768 KiB (24 instances) L2 cache: 12 MiB (24 instances) L3 cache: 96 MiB (6 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.77 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.46.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB PHB PHB 0-47 0 N/A GPU1 PHB X PHB PHB 0-47 0 N/A GPU2 PHB PHB X PHB 0-47 0 N/A GPU3 PHB PHB PHB X 0-47 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks LD_LIBRARY_PATH=/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/cv2/../../lib64: CUDA_MODULE_LOADING=LAZY ```

Model Input Dumps

No response

🐛 Describe the bug

Tried to run model using vllm pip package: version: 0.6.3.post1 Command that I ran:

vllm serve $model --host 0.0.0.0 --port 8010 --dtype bfloat16 --enforce-eager --gpu-memory-utilization 0.95 --api-key RnD12345 --max-model-len 4098 --tensor-parallel-size 4 --quantization bitsandbytes --load-format bitsandbytes

Tried using docker image as well: docker image inspect output:

[
    {
        "Id": "sha256:9de570dfcdfca4effabe006779215326faf1f812c1d522652c5010801b8e6d78",
        "RepoTags": [
            "vllm/vllm-openai:latest"
        ],
        "RepoDigests": [
            "vllm/vllm-openai@sha256:facbbd4a92c1754675b239a5f22a281ed3aa8bde64662db8919d85d670673aa7"
        ],
        "Parent": "",
        "Comment": "buildkit.dockerfile.v0",
        "Created": "2024-10-17T10:54:46.210678506-07:00",
        "DockerVersion": "",
        "Author": "",
        "Config": {
            "Hostname": "",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "NVARCH=x86_64",
                "NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536",
                "NV_CUDA_CUDART_VERSION=12.4.127-1",
                "NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-4",
                "CUDA_VERSION=12.4.1",
                "LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
                "NVIDIA_VISIBLE_DEVICES=all",
                "NVIDIA_DRIVER_CAPABILITIES=compute,utility",
                "DEBIAN_FRONTEND=noninteractive",
                "VLLM_USAGE_SOURCE=production-docker-image"
            ],
            "Cmd": null,
            "Image": "",
            "Volumes": null,
            "WorkingDir": "/vllm-workspace",
            "Entrypoint": [
                "python3",
                "-m",
                "vllm.entrypoints.openai.api_server"
            ],
            "OnBuild": null,
            "Labels": {
                "maintainer": "NVIDIA CORPORATION <cudatools@nvidia.com>",
                "org.opencontainers.image.ref.name": "ubuntu",
                "org.opencontainers.image.version": "22.04"
            }
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 10433172732,
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/6434a00d5febd5b0e5c5f649ba730d6eecb98569a1ca91d070b6d91e76413c74/diff:/var/lib/docker/overlay2/dd0e5c6c529ddf94791c0f64d2ad36ec5d38cfc0a20062e11b12bd532d7c8d1c/diff:/var/lib/docker/overlay2/59dc2d24ef6995a6d7cd14e5d48cd0f8794e011f90ad682edbd71e7167a7ef96/diff:/var/lib/docker/overlay2/4d3d040ed53719f59f0c9dd300e4e05cc7f71dc19fb62468b5838174533909d0/diff:/var/lib/docker/overlay2/fc4f6caf8e0c462529b2209e60fac58cc1bbaa56a5dec19411f21d6b611fed10/diff:/var/lib/docker/overlay2/62d1f4e9911d8387c12c7386bb89aa91d689e5a5e7a50af3a1ae1a3ead2aac22/diff:/var/lib/docker/overlay2/0ea99b169b186ada6653ac6dda2617f17fa3da9112e6ee339937fff09ca69751/diff:/var/lib/docker/overlay2/8821136cefb26d4a59cc1bc75b3c7c57d2c65d5415e6f9e52994098a92969757/diff:/var/lib/docker/overlay2/3b814768ba4a7e39ee721e9c8365a82843ae46575a18d7ee6ce416d7f6c27784/diff:/var/lib/docker/overlay2/577f5d8f7131ef41ab3edc0d0be71053be04108b59ddcde1d768801bc928d3f0/diff:/var/lib/docker/overlay2/70edcf83f3b2e679e35246cc963136beddb591a7fecf966a8ff386aca74baf78/diff:/var/lib/docker/overlay2/3d8c594fc1d94e07c114d452c861daf69920d0f6c03c0988fb4c20254811ff27/diff",
                "MergedDir": "/var/lib/docker/overlay2/cd24f65cdf9dad054219845152d5e2490a94ca494c3f65abace5358179f813f1/merged",
                "UpperDir": "/var/lib/docker/overlay2/cd24f65cdf9dad054219845152d5e2490a94ca494c3f65abace5358179f813f1/diff",
                "WorkDir": "/var/lib/docker/overlay2/cd24f65cdf9dad054219845152d5e2490a94ca494c3f65abace5358179f813f1/work"
            },
            "Name": "overlay2"
        },
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:e0a9f5911802534ba097660206feabeb0247a81e409029167b30e2e1f2803b57",
                "sha256:47654eeadbc543f1dd44ebb41c6ca0954b9f1813efabdcd46b0e7f17ac4e9fd1",
                "sha256:efe2b79b53de08e199a2ce107d83adc0cbdff94f605e03f5ef51f3df3ae31cfd",
                "sha256:46d54736d31f4bfeb749544e22e8611dba11128bdb2cebb820a3b452b50e7d52",
                "sha256:809d3bb9c80fb3d31d4c061ba0b38ba4e83b6329e33c2cb2bbf27251a8e527c6",
                "sha256:c1773f613c662964cd11fa206779f38c92fb00e3a835e7931070285cfd68aadd",
                "sha256:e37cbea0d17fcd8906d4136f2eab993b6eef803eb8141933f596e7b6d78c6636",
                "sha256:e819f72bf6d3ec7ed9a85b73f5b1af866fe7765b1d0a7a9908cbca1fde836038",
                "sha256:0f1574f66fa6175a23113c89ce6fb60f321c30c428ccccf2f4061c42375fa766",
                "sha256:387174fca07f20756620f4cac2d43d9bfd5eb1fb2fba3d3196e37d8302251d38",
                "sha256:ff45bc8aa7f980af223369d59806c1a27fd6032bd82734be46f860ad2e7f8325",
                "sha256:1ad2df56152854745a66ec7b3fb7547ab6f07013d11ec2d91da353cbddd3d8d0",
                "sha256:5749f2a270664189ce6f8f41e9909053b6f3fcf8e7c4a9ac156c68a188d69b02"
            ]
        },
        "Metadata": {
            "LastTagTime": "0001-01-01T00:00:00Z"
        }
    }
]

docker cmd used:

docker run --name llm_dist --gpus all \
    -v $volume:/data \
    -p 8010:80 \
    --ipc=host \
    --runtime=nvidia \
    vllm/vllm-openai:latest \
    --model $model \
    --dtype bfloat16 \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --tensor-parallel-size 4

vllm is stuck at loading in both cases:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1B.0 Off |                    0 |
|  0%   31C    P0             65W /  300W |     516MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10G                    Off |   00000000:00:1C.0 Off |                    0 |
|  0%   31C    P0             63W /  300W |     524MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10G                    Off |   00000000:00:1D.0 Off |                    0 |
|  0%   32C    P0             67W /  300W |     524MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   31C    P0             66W /  300W |     484MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

vllm log:

WARNING 11-08 11:49:58 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 11-08 11:49:58 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=36253) INFO 11-08 11:50:03 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=36252) INFO 11-08 11:50:03 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=36254) INFO 11-08 11:50:03 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 11-08 11:50:04 utils.py:1008] Found nccl from library libnccl.so.2
INFO 11-08 11:50:04 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=36252) INFO 11-08 11:50:04 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=36253) INFO 11-08 11:50:04 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=36254) INFO 11-08 11:50:04 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=36252) INFO 11-08 11:50:04 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=36253) INFO 11-08 11:50:04 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=36254) INFO 11-08 11:50:04 pynccl.py:63] vLLM is using nccl==2.20.5

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 3 weeks ago

Please go through the troubleshooting guide and see if it can help resolve your issue.

SMAntony commented 3 weeks ago

Has anyone tried loading 70b model in AWS (g5.12xlarge) ?

SMAntony commented 3 weeks ago

Please go through the troubleshooting guide and see if it can help resolve your issue.

It does not help my situation sadly :(

DarkLight1337 commented 3 weeks ago

cc @youkaichao

youkaichao commented 3 weeks ago

the trouble shooting guide helps you collect more information, and you should provide the infromation for us to help debugging.

meanwhile, you are running a llama 70B model (which requires at least 140GB gpu memory), but you have only 4 a10g gpus (only 92 GB memory in total).

SMAntony commented 3 weeks ago

the trouble shooting guide helps you collect more information, and you should provide the infromation for us to help debugging.

meanwhile, you are running a llama 70B model (which requires at least 140GB gpu memory), but you have only 4 a10g gpus (only 92 GB memory in total).

I am using bitsandbytes quant so if I am not wrong, that should decrease the GPU memory to around 70 GB, enough to fit 70b model.

SMAntony commented 3 weeks ago

Unfortunately there is no workaround mentioned.

youkaichao commented 3 weeks ago

hit ctrl + C to see where is the process executing.

youkaichao commented 3 weeks ago

you can also add --load-format dummy to skip weight loading from disk, to isolate the root cause.

SMAntony commented 3 weeks ago

I see, I really appreciate your help. Will share results asap.

SMAntony commented 3 weeks ago

you can also add --load-format dummy to skip weight loading from disk, to isolate the root cause.

Still stuck when loading dummy weights.

Output from vllm docker:

INFO 11-12 00:34:53 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 11-12 00:34:53 api_server.py:529] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/models--meta-llama--llama-3.1-70b-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='dummy', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 11-12 00:34:53 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/318c6aef-9ee2-40c5-aef6-8410a465f6f7 for IPC Path.
INFO 11-12 00:34:53 api_server.py:179] Started engine process with PID 36
INFO 11-12 00:34:56 config.py:905] Defaulting to use mp for distributed inference
WARNING 11-12 00:34:56 arg_utils.py:957] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
WARNING 11-12 00:34:56 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-12 00:34:56 config.py:1021] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 11-12 00:35:00 config.py:905] Defaulting to use mp for distributed inference
WARNING 11-12 00:35:00 arg_utils.py:957] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
WARNING 11-12 00:35:00 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-12 00:35:00 config.py:1021] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 11-12 00:35:00 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/data/models--meta-llama--llama-3.1-70b-instruct', speculative_config=None, tokenizer='/data/models--meta-llama--llama-3.1-70b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.DUMMY, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models--meta-llama--llama-3.1-70b-instruct, num_scheduler_steps=1, chunked_prefill_enabled=True multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 11-12 00:35:00 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 11-12 00:35:00 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=148) INFO 11-12 00:35:00 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=150) INFO 11-12 00:35:00 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=149) INFO 11-12 00:35:00 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=149) INFO 11-12 00:35:02 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=149) INFO 11-12 00:35:02 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=148) INFO 11-12 00:35:02 utils.py:1008] Found nccl from library libnccl.so.2
INFO 11-12 00:35:02 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=148) INFO 11-12 00:35:02 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 11-12 00:35:02 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=150) INFO 11-12 00:35:02 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=150) INFO 11-12 00:35:02 pynccl.py:63] vLLM is using nccl==2.20.5

Nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:38:00.0 Off |                    0 |
| N/A   48C    P0             31W /   72W |     382MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      Off |   00000000:3A:00.0 Off |                    0 |
| N/A   47C    P0             30W /   72W |     386MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      Off |   00000000:3C:00.0 Off |                    0 |
| N/A   47C    P0             30W /   72W |     386MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      Off |   00000000:3E:00.0 Off |                    0 |
| N/A   46C    P0             30W /   72W |     366MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Ctrl+C output:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 190, in build_async_engine_client_from_engine_args
    await mp_engine_client.setup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 231, in setup
    response = await self._wait_for_server_rpc(socket)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 334, in _wait_for_server_rpc
    return await self._send_get_data_rpc_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 265, in _send_get_data_rpc_request
    if await socket.poll(timeout=VLLM_RPC_TIMEOUT) == 0:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/loop.pyx", line 476, in uvloop.loop.Loop._on_idle
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 206, in build_async_engine_client_from_engine_args
    engine_process.join(4)
  File "/usr/lib/python3.12/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/popen_fork.py", line 40, in wait
    if not wait([self.sentinel], timeout):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 1136, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()

youkaichao commented 3 weeks ago

I am using bitsandbytes quant

I don't see this information from the output.

And the gpu utilization is 100%. it is doing something there.

if you really want to figure it out, see https://docs.vllm.ai/en/latest/getting_started/debugging.html#enable-more-logging , and set export VLLM_TRACE_FUNCTION=1 . you will know what functions are taking time from the trace log file.

SMAntony commented 3 weeks ago

I think we cannot initialize dummy weights with bitsandbytes because I got this error.

ValueError: BitsAndBytes quantization and QLoRA adapter only support 'bitsandbytes' load format, but got dummy

SMAntony commented 3 weeks ago

With regards to the GPU Vol Util. being 100%, yes I did see that but when given time, it just stays like that without any progress in loading the model.

SMAntony commented 3 weeks ago

I am using bitsandbytes quant

I don't see this information from the output.

And the gpu utilization is 100%. it is doing something there.

if you really want to figure it out, see https://docs.vllm.ai/en/latest/getting_started/debugging.html#enable-more-logging , and set export VLLM_TRACE_FUNCTION=1 . you will know what functions are taking time from the trace log file.

I have set all debug environment vars from here but I still get the same error traceback as above.

Ctrl+C output after getting debug vars:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 190, in build_async_engine_client_from_engine_args
    await mp_engine_client.setup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 231, in setup
    response = await self._wait_for_server_rpc(socket)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 334, in _wait_for_server_rpc
    return await self._send_get_data_rpc_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 265, in _send_get_data_rpc_request
    if await socket.poll(timeout=VLLM_RPC_TIMEOUT) == 0:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/loop.pyx", line 476, in uvloop.loop.Loop._on_idle
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 206, in build_async_engine_client_from_engine_args
    engine_process.join(4)
  File "/usr/lib/python3.12/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/popen_fork.py", line 40, in wait
    if not wait([self.sentinel], timeout):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 1136, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
KeyboardInterrupt

SMAntony commented 3 weeks ago

Hey, I think I got something. I ran the script provided in debugging. Here's the output, does this tell anything?

[rank1]:[E1112 09:49:50.022718554 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
[rank0]:[E1112 09:49:50.029035281 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
[rank1]:[E1112 09:49:50.032579059 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E1112 09:49:50.032589219 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
ip-172-31-31-233:5801:5850 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[rank3]:[E1112 09:49:50.093793947 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
[rank3]:[E1112 09:49:50.093943580 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ubuntu/work/llm_dist/test.py", line 9, in <module>
[rank1]:     assert value == dist.get_world_size()
[rank1]: AssertionError
[rank2]:[E1112 09:49:50.117030933 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
[rank2]:[E1112 09:49:50.117165526 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
ip-172-31-31-233:5800:5848 [0] NCCL INFO [Service thread] Connection closed by localRank 0
ip-172-31-31-233:5800:5833 [0] NCCL INFO comm 0x5fa13981ee20 rank 0 nranks 4 cudaDev 0 busId 38000 - Abort COMPLETE
[rank0]:[E1112 09:49:50.173368240 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E1112 09:49:50.173388880 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1112 09:49:50.173394501 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/work/llm_dist/test.py", line 9, in <module>
[rank0]:     assert value == dist.get_world_size()
[rank0]: AssertionError
[rank0]:[E1112 09:49:50.178292192 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cee1f777f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7cedd0dc88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cedd0dcf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7cedd0dd16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7cee1e8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x7cee2069ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7cee20729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cee1f777f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7cedd0dc88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cedd0dcf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7cedd0dd16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7cee1e8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x7cee2069ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7cee20729c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cee1f777f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7cedd0a5aa84 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xecdb4 (0x7cee1e8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9ca94 (0x7cee2069ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7cee20729c3c in /lib/x86_64-linux-gnu/libc.so.6)

ip-172-31-31-233:5803:5849 [3] NCCL INFO [Service thread] Connection closed by localRank 3
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/ubuntu/work/llm_dist/test.py", line 9, in <module>
[rank3]:     assert value == dist.get_world_size()
[rank3]: AssertionError
ip-172-31-31-233:5802:5847 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ubuntu/work/llm_dist/test.py", line 9, in <module>
[rank2]:     assert value == dist.get_world_size()
[rank2]: AssertionError
ip-172-31-31-233:5801:5829 [0] NCCL INFO comm 0x5e65c4561d00 rank 1 nranks 4 cudaDev 1 busId 3a000 - Abort COMPLETE
[rank1]:[E1112 09:49:50.605962855 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1112 09:49:50.605984516 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1112 09:49:50.605990606 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1112 09:49:50.607099134 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7add9a763f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7add4b9c88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7add4b9cf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7add4b9d16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7add994ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x7add9b29ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7add9b329c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7add9a763f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7add4b9c88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7add4b9cf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7add4b9d16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7add994ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x7add9b29ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7add9b329c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7add9a763f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7add4b65aa84 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xecdb4 (0x7add994ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9ca94 (0x7add9b29ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7add9b329c3c in /lib/x86_64-linux-gnu/libc.so.6)

ip-172-31-31-233:5803:5827 [0] NCCL INFO comm 0x5b3a1f5a7060 rank 3 nranks 4 cudaDev 3 busId 3e000 - Abort COMPLETE
[rank3]:[E1112 09:49:50.743894047 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1112 09:49:50.743921998 ProcessGroupNCCL.cpp:621] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1112 09:49:50.743936648 ProcessGroupNCCL.cpp:627] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1112 09:49:50.745029175 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70371885ef86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7036c99c88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7036c99cf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7036c99d16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7037174ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x70371949ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x703719529c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70371885ef86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7036c99c88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7036c99cf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7036c99d16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7037174ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x70371949ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x703719529c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70371885ef86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7036c965aa84 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xecdb4 (0x7037174ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9ca94 (0x70371949ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x703719529c3c in /lib/x86_64-linux-gnu/libc.so.6)

ip-172-31-31-233:5802:5831 [0] NCCL INFO comm 0x62611899bcc0 rank 2 nranks 4 cudaDev 2 busId 3c000 - Abort COMPLETE
[rank2]:[E1112 09:49:50.782271879 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1112 09:49:50.782294969 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1112 09:49:50.782301189 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1112 09:49:50.783340025 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79482cb53f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7947dddc88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7947dddcf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7947dddd16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x79482b8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x79482d69ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x79482d729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79482cb53f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7947dddc88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7947dddcf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7947dddd16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x79482b8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x79482d69ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x79482d729c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79482cb53f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7947dda5aa84 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xecdb4 (0x79482b8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9ca94 (0x79482d69ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x79482d729c3c in /lib/x86_64-linux-gnu/libc.so.6)

W1112 09:49:51.606000 128040271825792 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5801 closing signal SIGTERM
W1112 09:49:51.606000 128040271825792 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5802 closing signal SIGTERM
W1112 09:49:51.607000 128040271825792 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5803 closing signal SIGTERM
E1112 09:49:54.791000 128040271825792 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 5800) of binary: /home/ubuntu/work/llm_dist/.venv/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/work/llm_dist/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
test.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-12_09:49:51
  host      : ip-172-31-31-233.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 5800)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 5800
=====================================================

SMAntony commented 3 weeks ago

Hi @youkaichao and @DarkLight1337, It works now, after reinstalling cuda and nvidia drivers with latest version. I also reset vllm. Thank you for your help and guidance.

I was able to solve it using test.py provided with the debugging docs.

youkaichao commented 3 weeks ago

glad it works.

vllm-project / vllm