vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.3k stars 4.75k forks source link

[Bug]: Unable to load Llama-3.1-70B-Instruct using either `vllm serve` or `vllm-openai` docker #10156

Closed SMAntony closed 3 weeks ago

SMAntony commented 3 weeks ago

Your current environment

The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 24.04 LTS (x86_64) GCC version: (Ubuntu 13.2.0-23ubuntu4) 13.2.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39 Python version: 3.10.15 (main, Nov 8 2024, 11:29:01) [GCC 13.2.0] (64-bit runtime) Python platform: Linux-6.8.0-1018-aws-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.0.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G GPU 1: NVIDIA A10G GPU 2: NVIDIA A10G GPU 3: NVIDIA A10G Nvidia driver version: 555.42.06 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R32 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 0 BogoMIPS: 5599.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 768 KiB (24 instances) L1i cache: 768 KiB (24 instances) L2 cache: 12 MiB (24 instances) L3 cache: 96 MiB (6 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.77 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.46.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB PHB PHB 0-47 0 N/A GPU1 PHB X PHB PHB 0-47 0 N/A GPU2 PHB PHB X PHB 0-47 0 N/A GPU3 PHB PHB PHB X 0-47 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks LD_LIBRARY_PATH=/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/cv2/../../lib64: CUDA_MODULE_LOADING=LAZY ```

Model Input Dumps

No response

🐛 Describe the bug

Tried to run model using vllm pip package: version: 0.6.3.post1 Command that I ran:

vllm serve $model --host 0.0.0.0 --port 8010 --dtype bfloat16 --enforce-eager --gpu-memory-utilization 0.95 --api-key RnD12345 --max-model-len 4098 --tensor-parallel-size 4 --quantization bitsandbytes --load-format bitsandbytes 

Tried using docker image as well: docker image inspect output:

[
    {
        "Id": "sha256:9de570dfcdfca4effabe006779215326faf1f812c1d522652c5010801b8e6d78",
        "RepoTags": [
            "vllm/vllm-openai:latest"
        ],
        "RepoDigests": [
            "vllm/vllm-openai@sha256:facbbd4a92c1754675b239a5f22a281ed3aa8bde64662db8919d85d670673aa7"
        ],
        "Parent": "",
        "Comment": "buildkit.dockerfile.v0",
        "Created": "2024-10-17T10:54:46.210678506-07:00",
        "DockerVersion": "",
        "Author": "",
        "Config": {
            "Hostname": "",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "NVARCH=x86_64",
                "NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536",
                "NV_CUDA_CUDART_VERSION=12.4.127-1",
                "NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-4",
                "CUDA_VERSION=12.4.1",
                "LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
                "NVIDIA_VISIBLE_DEVICES=all",
                "NVIDIA_DRIVER_CAPABILITIES=compute,utility",
                "DEBIAN_FRONTEND=noninteractive",
                "VLLM_USAGE_SOURCE=production-docker-image"
            ],
            "Cmd": null,
            "Image": "",
            "Volumes": null,
            "WorkingDir": "/vllm-workspace",
            "Entrypoint": [
                "python3",
                "-m",
                "vllm.entrypoints.openai.api_server"
            ],
            "OnBuild": null,
            "Labels": {
                "maintainer": "NVIDIA CORPORATION <cudatools@nvidia.com>",
                "org.opencontainers.image.ref.name": "ubuntu",
                "org.opencontainers.image.version": "22.04"
            }
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 10433172732,
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/6434a00d5febd5b0e5c5f649ba730d6eecb98569a1ca91d070b6d91e76413c74/diff:/var/lib/docker/overlay2/dd0e5c6c529ddf94791c0f64d2ad36ec5d38cfc0a20062e11b12bd532d7c8d1c/diff:/var/lib/docker/overlay2/59dc2d24ef6995a6d7cd14e5d48cd0f8794e011f90ad682edbd71e7167a7ef96/diff:/var/lib/docker/overlay2/4d3d040ed53719f59f0c9dd300e4e05cc7f71dc19fb62468b5838174533909d0/diff:/var/lib/docker/overlay2/fc4f6caf8e0c462529b2209e60fac58cc1bbaa56a5dec19411f21d6b611fed10/diff:/var/lib/docker/overlay2/62d1f4e9911d8387c12c7386bb89aa91d689e5a5e7a50af3a1ae1a3ead2aac22/diff:/var/lib/docker/overlay2/0ea99b169b186ada6653ac6dda2617f17fa3da9112e6ee339937fff09ca69751/diff:/var/lib/docker/overlay2/8821136cefb26d4a59cc1bc75b3c7c57d2c65d5415e6f9e52994098a92969757/diff:/var/lib/docker/overlay2/3b814768ba4a7e39ee721e9c8365a82843ae46575a18d7ee6ce416d7f6c27784/diff:/var/lib/docker/overlay2/577f5d8f7131ef41ab3edc0d0be71053be04108b59ddcde1d768801bc928d3f0/diff:/var/lib/docker/overlay2/70edcf83f3b2e679e35246cc963136beddb591a7fecf966a8ff386aca74baf78/diff:/var/lib/docker/overlay2/3d8c594fc1d94e07c114d452c861daf69920d0f6c03c0988fb4c20254811ff27/diff",
                "MergedDir": "/var/lib/docker/overlay2/cd24f65cdf9dad054219845152d5e2490a94ca494c3f65abace5358179f813f1/merged",
                "UpperDir": "/var/lib/docker/overlay2/cd24f65cdf9dad054219845152d5e2490a94ca494c3f65abace5358179f813f1/diff",
                "WorkDir": "/var/lib/docker/overlay2/cd24f65cdf9dad054219845152d5e2490a94ca494c3f65abace5358179f813f1/work"
            },
            "Name": "overlay2"
        },
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:e0a9f5911802534ba097660206feabeb0247a81e409029167b30e2e1f2803b57",
                "sha256:47654eeadbc543f1dd44ebb41c6ca0954b9f1813efabdcd46b0e7f17ac4e9fd1",
                "sha256:efe2b79b53de08e199a2ce107d83adc0cbdff94f605e03f5ef51f3df3ae31cfd",
                "sha256:46d54736d31f4bfeb749544e22e8611dba11128bdb2cebb820a3b452b50e7d52",
                "sha256:809d3bb9c80fb3d31d4c061ba0b38ba4e83b6329e33c2cb2bbf27251a8e527c6",
                "sha256:c1773f613c662964cd11fa206779f38c92fb00e3a835e7931070285cfd68aadd",
                "sha256:e37cbea0d17fcd8906d4136f2eab993b6eef803eb8141933f596e7b6d78c6636",
                "sha256:e819f72bf6d3ec7ed9a85b73f5b1af866fe7765b1d0a7a9908cbca1fde836038",
                "sha256:0f1574f66fa6175a23113c89ce6fb60f321c30c428ccccf2f4061c42375fa766",
                "sha256:387174fca07f20756620f4cac2d43d9bfd5eb1fb2fba3d3196e37d8302251d38",
                "sha256:ff45bc8aa7f980af223369d59806c1a27fd6032bd82734be46f860ad2e7f8325",
                "sha256:1ad2df56152854745a66ec7b3fb7547ab6f07013d11ec2d91da353cbddd3d8d0",
                "sha256:5749f2a270664189ce6f8f41e9909053b6f3fcf8e7c4a9ac156c68a188d69b02"
            ]
        },
        "Metadata": {
            "LastTagTime": "0001-01-01T00:00:00Z"
        }
    }
]

docker cmd used:

docker run --name llm_dist --gpus all \
    -v $volume:/data \
    -p 8010:80 \
    --ipc=host \
    --runtime=nvidia \
    vllm/vllm-openai:latest \
    --model $model \
    --dtype bfloat16 \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --tensor-parallel-size 4

vllm is stuck at loading in both cases:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1B.0 Off |                    0 |
|  0%   31C    P0             65W /  300W |     516MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10G                    Off |   00000000:00:1C.0 Off |                    0 |
|  0%   31C    P0             63W /  300W |     524MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10G                    Off |   00000000:00:1D.0 Off |                    0 |
|  0%   32C    P0             67W /  300W |     524MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   31C    P0             66W /  300W |     484MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

vllm log:

WARNING 11-08 11:49:58 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 11-08 11:49:58 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=36253) INFO 11-08 11:50:03 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=36252) INFO 11-08 11:50:03 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=36254) INFO 11-08 11:50:03 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 11-08 11:50:04 utils.py:1008] Found nccl from library libnccl.so.2
INFO 11-08 11:50:04 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=36252) INFO 11-08 11:50:04 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=36253) INFO 11-08 11:50:04 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=36254) INFO 11-08 11:50:04 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=36252) INFO 11-08 11:50:04 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=36253) INFO 11-08 11:50:04 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=36254) INFO 11-08 11:50:04 pynccl.py:63] vLLM is using nccl==2.20.5

Before submitting a new issue...

DarkLight1337 commented 3 weeks ago

Please go through the troubleshooting guide and see if it can help resolve your issue.

SMAntony commented 3 weeks ago

Has anyone tried loading 70b model in AWS (g5.12xlarge) ?

SMAntony commented 3 weeks ago

Please go through the troubleshooting guide and see if it can help resolve your issue.

It does not help my situation sadly :(

DarkLight1337 commented 3 weeks ago

cc @youkaichao

youkaichao commented 3 weeks ago

the trouble shooting guide helps you collect more information, and you should provide the infromation for us to help debugging.

meanwhile, you are running a llama 70B model (which requires at least 140GB gpu memory), but you have only 4 a10g gpus (only 92 GB memory in total).

SMAntony commented 3 weeks ago

the trouble shooting guide helps you collect more information, and you should provide the infromation for us to help debugging.

meanwhile, you are running a llama 70B model (which requires at least 140GB gpu memory), but you have only 4 a10g gpus (only 92 GB memory in total).

I am using bitsandbytes quant so if I am not wrong, that should decrease the GPU memory to around 70 GB, enough to fit 70b model.

SMAntony commented 3 weeks ago

image Unfortunately there is no workaround mentioned.

youkaichao commented 3 weeks ago

hit ctrl + C to see where is the process executing.

youkaichao commented 3 weeks ago

you can also add --load-format dummy to skip weight loading from disk, to isolate the root cause.

SMAntony commented 3 weeks ago

I see, I really appreciate your help. Will share results asap.

SMAntony commented 3 weeks ago

you can also add --load-format dummy to skip weight loading from disk, to isolate the root cause.

Still stuck when loading dummy weights.

Output from vllm docker:

INFO 11-12 00:34:53 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 11-12 00:34:53 api_server.py:529] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/models--meta-llama--llama-3.1-70b-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='dummy', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 11-12 00:34:53 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/318c6aef-9ee2-40c5-aef6-8410a465f6f7 for IPC Path.
INFO 11-12 00:34:53 api_server.py:179] Started engine process with PID 36
INFO 11-12 00:34:56 config.py:905] Defaulting to use mp for distributed inference
WARNING 11-12 00:34:56 arg_utils.py:957] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
WARNING 11-12 00:34:56 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-12 00:34:56 config.py:1021] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 11-12 00:35:00 config.py:905] Defaulting to use mp for distributed inference
WARNING 11-12 00:35:00 arg_utils.py:957] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
WARNING 11-12 00:35:00 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-12 00:35:00 config.py:1021] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 11-12 00:35:00 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/data/models--meta-llama--llama-3.1-70b-instruct', speculative_config=None, tokenizer='/data/models--meta-llama--llama-3.1-70b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.DUMMY, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models--meta-llama--llama-3.1-70b-instruct, num_scheduler_steps=1, chunked_prefill_enabled=True multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 11-12 00:35:00 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 11-12 00:35:00 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=148) INFO 11-12 00:35:00 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=150) INFO 11-12 00:35:00 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=149) INFO 11-12 00:35:00 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=149) INFO 11-12 00:35:02 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=149) INFO 11-12 00:35:02 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=148) INFO 11-12 00:35:02 utils.py:1008] Found nccl from library libnccl.so.2
INFO 11-12 00:35:02 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=148) INFO 11-12 00:35:02 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 11-12 00:35:02 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=150) INFO 11-12 00:35:02 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=150) INFO 11-12 00:35:02 pynccl.py:63] vLLM is using nccl==2.20.5

Nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:38:00.0 Off |                    0 |
| N/A   48C    P0             31W /   72W |     382MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      Off |   00000000:3A:00.0 Off |                    0 |
| N/A   47C    P0             30W /   72W |     386MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      Off |   00000000:3C:00.0 Off |                    0 |
| N/A   47C    P0             30W /   72W |     386MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      Off |   00000000:3E:00.0 Off |                    0 |
| N/A   46C    P0             30W /   72W |     366MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Ctrl+C output:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 190, in build_async_engine_client_from_engine_args
    await mp_engine_client.setup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 231, in setup
    response = await self._wait_for_server_rpc(socket)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 334, in _wait_for_server_rpc
    return await self._send_get_data_rpc_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 265, in _send_get_data_rpc_request
    if await socket.poll(timeout=VLLM_RPC_TIMEOUT) == 0:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/loop.pyx", line 476, in uvloop.loop.Loop._on_idle
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 206, in build_async_engine_client_from_engine_args
    engine_process.join(4)
  File "/usr/lib/python3.12/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/popen_fork.py", line 40, in wait
    if not wait([self.sentinel], timeout):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 1136, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
youkaichao commented 3 weeks ago

I am using bitsandbytes quant

I don't see this information from the output.

And the gpu utilization is 100%. it is doing something there.

if you really want to figure it out, see https://docs.vllm.ai/en/latest/getting_started/debugging.html#enable-more-logging , and set export VLLM_TRACE_FUNCTION=1 . you will know what functions are taking time from the trace log file.

SMAntony commented 3 weeks ago

I think we cannot initialize dummy weights with bitsandbytes because I got this error.

ValueError: BitsAndBytes quantization and QLoRA adapter only support 'bitsandbytes' load format, but got dummy
SMAntony commented 3 weeks ago

With regards to the GPU Vol Util. being 100%, yes I did see that but when given time, it just stays like that without any progress in loading the model.

SMAntony commented 3 weeks ago

I am using bitsandbytes quant

I don't see this information from the output.

And the gpu utilization is 100%. it is doing something there.

if you really want to figure it out, see https://docs.vllm.ai/en/latest/getting_started/debugging.html#enable-more-logging , and set export VLLM_TRACE_FUNCTION=1 . you will know what functions are taking time from the trace log file.

I have set all debug environment vars from here but I still get the same error traceback as above.

Ctrl+C output after getting debug vars:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 190, in build_async_engine_client_from_engine_args
    await mp_engine_client.setup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 231, in setup
    response = await self._wait_for_server_rpc(socket)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 334, in _wait_for_server_rpc
    return await self._send_get_data_rpc_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 265, in _send_get_data_rpc_request
    if await socket.poll(timeout=VLLM_RPC_TIMEOUT) == 0:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/loop.pyx", line 476, in uvloop.loop.Loop._on_idle
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 206, in build_async_engine_client_from_engine_args
    engine_process.join(4)
  File "/usr/lib/python3.12/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/popen_fork.py", line 40, in wait
    if not wait([self.sentinel], timeout):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 1136, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
KeyboardInterrupt
SMAntony commented 3 weeks ago

Hey, I think I got something. I ran the script provided in debugging. Here's the output, does this tell anything?

[rank1]:[E1112 09:49:50.022718554 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
[rank0]:[E1112 09:49:50.029035281 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
[rank1]:[E1112 09:49:50.032579059 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E1112 09:49:50.032589219 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
ip-172-31-31-233:5801:5850 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[rank3]:[E1112 09:49:50.093793947 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
[rank3]:[E1112 09:49:50.093943580 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ubuntu/work/llm_dist/test.py", line 9, in <module>
[rank1]:     assert value == dist.get_world_size()
[rank1]: AssertionError
[rank2]:[E1112 09:49:50.117030933 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
[rank2]:[E1112 09:49:50.117165526 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
ip-172-31-31-233:5800:5848 [0] NCCL INFO [Service thread] Connection closed by localRank 0
ip-172-31-31-233:5800:5833 [0] NCCL INFO comm 0x5fa13981ee20 rank 0 nranks 4 cudaDev 0 busId 38000 - Abort COMPLETE
[rank0]:[E1112 09:49:50.173368240 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E1112 09:49:50.173388880 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1112 09:49:50.173394501 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/work/llm_dist/test.py", line 9, in <module>
[rank0]:     assert value == dist.get_world_size()
[rank0]: AssertionError
[rank0]:[E1112 09:49:50.178292192 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cee1f777f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7cedd0dc88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cedd0dcf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7cedd0dd16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7cee1e8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x7cee2069ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7cee20729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cee1f777f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7cedd0dc88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cedd0dcf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7cedd0dd16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7cee1e8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x7cee2069ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7cee20729c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cee1f777f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7cedd0a5aa84 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xecdb4 (0x7cee1e8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9ca94 (0x7cee2069ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7cee20729c3c in /lib/x86_64-linux-gnu/libc.so.6)

ip-172-31-31-233:5803:5849 [3] NCCL INFO [Service thread] Connection closed by localRank 3
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/ubuntu/work/llm_dist/test.py", line 9, in <module>
[rank3]:     assert value == dist.get_world_size()
[rank3]: AssertionError
ip-172-31-31-233:5802:5847 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ubuntu/work/llm_dist/test.py", line 9, in <module>
[rank2]:     assert value == dist.get_world_size()
[rank2]: AssertionError
ip-172-31-31-233:5801:5829 [0] NCCL INFO comm 0x5e65c4561d00 rank 1 nranks 4 cudaDev 1 busId 3a000 - Abort COMPLETE
[rank1]:[E1112 09:49:50.605962855 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1112 09:49:50.605984516 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1112 09:49:50.605990606 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1112 09:49:50.607099134 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7add9a763f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7add4b9c88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7add4b9cf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7add4b9d16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7add994ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x7add9b29ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7add9b329c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7add9a763f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7add4b9c88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7add4b9cf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7add4b9d16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7add994ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x7add9b29ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7add9b329c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7add9a763f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7add4b65aa84 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xecdb4 (0x7add994ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9ca94 (0x7add9b29ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7add9b329c3c in /lib/x86_64-linux-gnu/libc.so.6)

ip-172-31-31-233:5803:5827 [0] NCCL INFO comm 0x5b3a1f5a7060 rank 3 nranks 4 cudaDev 3 busId 3e000 - Abort COMPLETE
[rank3]:[E1112 09:49:50.743894047 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1112 09:49:50.743921998 ProcessGroupNCCL.cpp:621] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1112 09:49:50.743936648 ProcessGroupNCCL.cpp:627] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1112 09:49:50.745029175 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70371885ef86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7036c99c88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7036c99cf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7036c99d16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7037174ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x70371949ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x703719529c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70371885ef86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7036c99c88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7036c99cf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7036c99d16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7037174ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x70371949ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x703719529c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70371885ef86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7036c965aa84 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xecdb4 (0x7037174ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9ca94 (0x70371949ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x703719529c3c in /lib/x86_64-linux-gnu/libc.so.6)

ip-172-31-31-233:5802:5831 [0] NCCL INFO comm 0x62611899bcc0 rank 2 nranks 4 cudaDev 2 busId 3c000 - Abort COMPLETE
[rank2]:[E1112 09:49:50.782271879 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1112 09:49:50.782294969 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1112 09:49:50.782301189 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1112 09:49:50.783340025 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79482cb53f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7947dddc88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7947dddcf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7947dddd16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x79482b8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x79482d69ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x79482d729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79482cb53f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7947dddc88d2 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7947dddcf313 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7947dddd16fc in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x79482b8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x79482d69ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x79482d729c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79482cb53f86 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7947dda5aa84 in /home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xecdb4 (0x79482b8ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9ca94 (0x79482d69ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x79482d729c3c in /lib/x86_64-linux-gnu/libc.so.6)

W1112 09:49:51.606000 128040271825792 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5801 closing signal SIGTERM
W1112 09:49:51.606000 128040271825792 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5802 closing signal SIGTERM
W1112 09:49:51.607000 128040271825792 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5803 closing signal SIGTERM
E1112 09:49:54.791000 128040271825792 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 5800) of binary: /home/ubuntu/work/llm_dist/.venv/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/work/llm_dist/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/work/llm_dist/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
test.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-12_09:49:51
  host      : ip-172-31-31-233.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 5800)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 5800
=====================================================
SMAntony commented 3 weeks ago

Hi @youkaichao and @DarkLight1337, It works now, after reinstalling cuda and nvidia drivers with latest version. I also reset vllm. Thank you for your help and guidance.

I was able to solve it using test.py provided with the debugging docs.

youkaichao commented 3 weeks ago

glad it works.