vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.74k stars 4.1k forks source link

[Bug]: Getting subprocess.CalledProcessError: Command '['/usr/bin/gcc',] error message. #3909

Open yk287 opened 5 months ago

yk287 commented 5 months ago

Your current environment

PyTorch version: 2.0.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.31

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.0-27-cloud-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 530.30.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 48 bits virtual
CPU(s):                             96
On-line CPU(s) list:                0-95
Thread(s) per core:                 2
Core(s) per socket:                 24
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              85
Model name:                         Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                           7
CPU MHz:                            2200.186
BogoMIPS:                           4400.37
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1.5 MiB
L1i cache:                          1.5 MiB
L2 cache:                           48 MiB
L3 cache:                           77 MiB
NUMA node0 CPU(s):                  0-23,48-71
NUMA node1 CPU(s):                  24-47,72-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT Host state unknown
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] torch==2.0.0+cu118
[pip3] torch-xla==2.0
[pip3] torchvision==0.15.1+cu118
[pip3] triton==2.0.0
[conda] dlenv-pytorch-2-0-gpu     1.0.20240112    py310h99bccc4_0    file:///tmp/conda-pkgs
[conda] numpy                     1.25.2                   pypi_0    pypi
[conda] torch                     2.0.0+cu118              pypi_0    pypi
[conda] torch-xla                 2.0                      pypi_0    pypi
[conda] torchvision               0.15.1+cu118             pypi_0    pypi
[conda] triton                    2.0.0                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-23,48-71      0
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-23,48-71      0
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-23,48-71      0
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-23,48-71      0
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    24-47,72-95     1
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    24-47,72-95     1
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    24-47,72-95     1
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      24-47,72-95     1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I'm trying to serve Mixtral-8x7B in my environment, using the following code

export MODEL=mistralai/Mixtral-8x7B-Instruct-v0.1
export PATH=$PATH:/sbin

python -m vllm.entrypoints.openai.api_server \
    --model $MODEL \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --served-model-name localModel \

I get the following error message

subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/var/tmp/tmpl_nuqjvz/main.c', '-O3', '-I/home/y0k007i/inference/.venv/lib/python3.9/site-packages/triton/common/../third_party/cuda/include', '-I/usr/include/python3.9', '-I/var/tmp/tmpl_nuqjvz', '-shared', '-fPIC', '-lcuda', '-o', '/var/tmp/tmpl_nuqjvz/fused_moe_kernel.cpython-39-x86_64-linux-gnu.so', '-L/lib/x86_64-linux-gnu', '-L/lib/x86_64-linux-gnu']' returned non-zero exit status 1. [W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Can someone tell me how I might be able to resolve this issue ?

Thanks!

hmellor commented 5 months ago

Can you share some more of the error, it's not clear where the error came from from the snippet you provided

a5723797 commented 1 month ago

I encountered same problem. vllm runs without problem with this command:

vllm serve neuralmagic/Mistral-Nemo-Instruct-2407-FP8 --api-key ubestream --max-model-len 1024 --gpu-memory-utilization 0.9 --max-num-seqs 64 --max-num-batched-tokens 8192 --block-size 16 --enable-prefix-caching --enforce-eager

However, by changing the --block-size to 8, the same error occured:

vllm serve neuralmagic/Mistral-Nemo-Instruct-2407-FP8 --api-key ubestream --max-model-len 1024 --gpu-memory-utilization 0.9 --max-num-seqs 64 --max-num-batched-tokens 8192 --block-size 8 --enable-prefix-caching --enforce-eager

Error: subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp91qfilw7/main.c', '-O3', '-I/home/server9/venv_vllm/lib/python3.10/site-packages/triton/common/../third_party/cuda/include', '-I/usr/include/python3.10', '-I/tmp/tmp91qfilw7', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmp91qfilw7/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-L/lib/x86_64-linux-gnu', '-L/lib/x86_64-linux-gnu']' returned non-zero exit status 1.

And the error cause was: /tmp/tmp91qfilw7/main.c:5:10: fatal error: Python.h: No such file or directory 5 | #include | ^~~~~~ compilation terminated.

By installing the python3-dev package the problem was solved in my case.

FYI