vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.76k stars 3.92k forks source link

[Bug]: RuntimeError: GET was unable to find an engine to execute this computation for llava-next model #6713

Open fdas3213 opened 1 month ago

fdas3213 commented 1 month ago

Your current environment

The output of `python collect_env.py`

Collecting environment information...
PyTorch version: 2.3.1+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: CBL-Mariner/Linux (x86_64)
GCC version: (GCC) 11.2.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35

Python version: 3.10.2 (main, Feb 22 2024, 00:00:03) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.138.1-4.cm2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB

Nvidia driver version: 525.85.12
cuDNN version: Probably one of the following:
/usr/lib/libcudnn.so.8.9.5
/usr/lib/libcudnn_adv_infer.so.8.9.5
/usr/lib/libcudnn_adv_train.so.8.9.5
/usr/lib/libcudnn_cnn_infer.so.8.9.5
/usr/lib/libcudnn_cnn_train.so.8.9.5
/usr/lib/libcudnn_ops_infer.so.8.9.5
/usr/lib/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             256
On-line CPU(s) list:                0-255
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7763 64-Core Processor
CPU family:                         25
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 64
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU max MHz:                        3529.0520
CPU min MHz:                        1500.0000
BogoMIPS:                           4900.05
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Virtualization:                     AMD-V
L1d cache:                          4 MiB (128 instances)
L1i cache:                          4 MiB (128 instances)
L2 cache:                           64 MiB (128 instances)
L3 cache:                           512 MiB (16 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-15,128-143
NUMA node1 CPU(s):                  16-31,144-159
NUMA node2 CPU(s):                  32-47,160-175
NUMA node3 CPU(s):                  48-63,176-191
NUMA node4 CPU(s):                  64-79,192-207
NUMA node5 CPU(s):                  80-95,208-223
NUMA node6 CPU(s):                  96-111,224-239
NUMA node7 CPU(s):                  112-127,240-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] flake8==4.0.1.1
[pip3] flake8-annotations-complexity==0.0.6.2
[pip3] flake8-bugbear==20.1.4
[pip3] flake8-builtins==1.4.2
[pip3] flake8-pie==0.5.0.1
[pip3] isolation-forest-onnx==0.1.4
[pip3] mypy==1.7.1
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.24.3
[pip3] nvidia-nccl-cu11==2.20.5
[pip3] oldest-supported-numpy==2022.5.28
[pip3] onnx==1.13.0
[pip3] onnxruntime==1.14.0
[pip3] torch==2.3.1+cu118
[pip3] torchvision==0.18.1+cu118
[pip3] transformers==4.42.4
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X  NV12    16-31,144-159   1
GPU1    NV12     X  80-95,208-223   5

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I was trying to run a basic multi-modal inference using llava-v1.6 using the below code

from vllm import LLM, SamplingParams
def run_vllm_inference(image, prompt):
    llm = LLM(model=os.path.join(model_local_path, "llava-v1.6-mistral-7b-hf"),  max_model_len=3000)
    sampling_params = SamplingParams(temperature=0.8,
                                     top_p=0.95,
                                     max_tokens=80)
    outputs = llm.generate(
        {
            "prompt": prompt,
            "multi_modal_data": {
                "image": image
            }
        },
        sampling_params=sampling_params
    )
    generated_text = ""
    for o in outputs:
        generated_text += o.outputs[0].text
    return generated_text

prompt_questions = [
    "[INST] <image>\nWhat does the text in the image say? [/INST]",
    "[INST] <image>\nWhat is the language of text in the image? [/INST]",
    "[INST] <image>\nGenerate a short description of the image [/INST]"
]

data = []
start=time.time()

for step, image in enumerate(sampled_images):
    answers = {}

    for raw_prompt in prompt_questions:
        prompt = " ".join(raw_prompt.split("\n"))
        response = run_vllm_inference(image, prompt)
        answers[prompt] = answers.get(prompt, response)
    data.append(answers)  

timediff=time.time()-start
print(f"When inferencing using vLLM on two gpus, image/sec: {timediff/n_samples_to_generate}, time elapsed: {timediff}")

However, I am hitting the error below

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[30], [line 14]
     [13] prompt = " ".join(raw_prompt.split("\n"))
---> [14]response = run_vllm_inference(image, prompt)
     [15] answers[prompt] = answers.get(prompt, response)

Cell In[21], [line 2]
      [1] def run_vllm_inference(image, prompt):
----> [2]    llm = LLM(model=os.path.join(model_local_path, "llava-v1.6-mistral-7b-hf"),  max_model_len=3000
     [3]         sampling_params = SamplingParams(temperature=0.8,
      [4]                       top_p=0.95,
      [5]                      max_tokens=80)
      [6]     outputs = llm.generate(
      [7]        {
      [8]           "prompt": prompt,
   (...)
     [13]        sampling_params=sampling_params
     [14]

File ~/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py:150, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
    [128](~/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py:128)     raise TypeError(
    [129](~/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py:129)         "There is no need to pass vision-related arguments anymore.")
...
    [455](~/.venv/lib/python3.10/site-packages/torch/nn/modules/conv.py:455)                     _pair(0), self.dilation, self.groups)
--> [456](~/.venv/lib/python3.10/site-packages/torch/nn/modules/conv.py:456) return F.conv2d(input, weight, bias, self.stride,
    [457](~/.venv/lib/python3.10/site-packages/torch/nn/modules/conv.py:457)                 self.padding, self.dilation, self.groups)

RuntimeError: GET was unable to find an engine to execute this computation

I had no issue when running inference using huggingface

# batch inference across multiple gpus
model = LlavaNextForConditionalGeneration.from_pretrained(os.path.join(model_local_path, "llava-v1.6-mistral-7b-hf"), quantization_config=quantization_config, device_map="auto")
model.config.pad_token_id = model.config.eos_token_id
processor = LlavaNextProcessor.from_pretrained(os.path.join(model_local_path, "llava-v1.6-mistral-7b-hf"))

def run_inference(prompt, image, model, processor):
    inputs = processor(prompt, image, return_tensors="pt").to(model.device)
    output = model.generate(**inputs, max_new_tokens=80)
    response = processor.batch_decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].split("[/INST]")[-1].strip()
    return response

data = []
start=time.time()

for step, image in enumerate(sampled_images):
    answers = {}

    for raw_prompt in prompt_questions:
        prompt = " ".join(raw_prompt.split("\n"))
        response = run_inference(raw_prompt, image, model, processor)
        answers[prompt] = answers.get(prompt, response)
    data.append(answers)  

timediff=time.time()-start
DarkLight1337 commented 1 month ago

Can you double-check that the CUDA version of pytorch matches that on your machine? There might be some incompatibilities that only surface when F.conv2d is called with half-precision inputs (IIRC HuggingFace uses float32 by default).

fdas3213 commented 1 month ago

@DarkLight1337 thanks for checking. The cuda version of pytorch is 11.8 which matches with nvcc --version


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0```
DarkLight1337 commented 1 month ago

Can you run the model in normal precision?

fdas3213 commented 1 month ago

@DarkLight1337 could you provide an example for how to run the model in normal precision? I just loaded the model and ran inference using the default setup, not sure where to specify precision

DarkLight1337 commented 1 month ago

You can use the --dtype argument as described here.

fdas3213 commented 1 month ago

thanks @DarkLight1337 . apologize for the late response. Specifying float or float32 gives another error

DarkLight1337 commented 1 month ago

thanks @DarkLight1337 . apologize for the late response. Specifying float or float32 gives another error

Could you elaborate?

fdas3213 commented 1 month ago

Apologize for missing the error log. It is a kernel crash error

21:10:02.243 [error] Disposing session as kernel process died ExitCode: undefined, Reason: 2024-08-01 21:06:40.427338: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
DarkLight1337 commented 1 month ago

@youkaichao any thoughts?

youkaichao commented 1 month ago

this might be relevant: https://discuss.pytorch.org/t/runtimeerror-get-was-unable-to-find-an-engine-to-execute-this-computation/193625/4

and also https://stackoverflow.com/a/76873442/9191338

and this https://github.com/ultralytics/ultralytics/issues/4060#issuecomment-1659826789