neuralmagic / nm-vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://nm-vllm.readthedocs.io
Other
251 stars 10 forks source link

[Bug]: When running repo hello world: RuntimeError: CUDA error: an illegal instruction was encountered #187

Closed remiconnesson closed 6 months ago

remiconnesson commented 7 months ago

Your current environment


Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 PCIe
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9334 32-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 1
Core(s) per socket:                 24
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           5399.98
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm arch_capabilities
Virtualization:                     AMD-V
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1.5 MiB (24 instances)
L1i cache:                          1.5 MiB (24 instances)
L2 cache:                           12 MiB (24 instances)
L3 cache:                           384 MiB (24 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.2.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-23    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

From the repo hello world example, I encounted an error

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/OpenHermes-2.5-Mistral-7B-marlin"
model = LLM(model_id, max_model_len=4096)
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)

messages = [
    {"role": "user", "content": "What is synthetic data in machine learning?"},
]
formatted_prompt =  tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = model.generate(formatted_prompt, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
INFO 04-13 12:07:28 config.py:217] The model is serialized in Marlin format. Using Marlin kernel.
INFO 04-13 12:07:28 llm_engine.py:74] Initializing an LLM engine (v0.2.0) with config: model='neuralmagic/OpenHermes-2.5-Mistral-7B-marlin', tokenizer='neuralmagic/OpenHermes-2.5-Mistral-7B-marlin', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=marlin, sparsity=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-13 12:07:29 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-13 12:07:29 selector.py:25] Using XFormers backend.
INFO 04-13 12:07:30 weight_utils.py:192] Using model weights format ['*.safetensors']
INFO 04-13 12:07:31 model_runner.py:106] Loading model weights took 3.8582 GB
INFO 04-13 12:07:32 gpu_executor.py:94] # GPU blocks: 34032, # CPU blocks: 2048
INFO 04-13 12:07:32 model_runner.py:793] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-13 12:07:32 model_runner.py:797] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Traceback (most recent call last):
  File "/home/ubuntu/experiment_00_marlin.py", line 5, in <module>
    model = LLM(model_id, max_model_len=4096)
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 121, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 198, in from_engine_args
    engine = cls(
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 112, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 40, in __init__
    self._init_cache()
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 107, in _init_cache
    self.driver_worker.warm_up_model()
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 167, in warm_up_model
    self.model_runner.capture_model(self.gpu_cache)
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 856, in capture_model
    graph_runner.capture(
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 917, in capture
    torch.cuda.synchronize()
  File "/home/ubuntu/marlin-pyvenv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 783, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
robertgshaw2-neuralmagic commented 7 months ago

Thanks for reporting. We will look into the issue.

alexm-neuralmagic commented 6 months ago

@remiconnesson thanks for posting this issue. We were able to reproduce it on H100 and find the cause. There was a problem with one of our PTX assembly codes, the fix is pretty simple. Here is the ongoing PR to fix it: https://github.com/vllm-project/vllm/pull/4218.

mgoin commented 6 months ago

Fix has landed, thanks for reporting!