vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.83k stars 4.5k forks source link

[Bug]: enable_prefix_caching leads to persistent illegal memory access error #6833

Open captify-sivakhno opened 3 months ago

captify-sivakhno commented 3 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1064-aws-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             32
On-line CPU(s) list:                0-31
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R32
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           5599.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          512 KiB (16 instances)
L1i cache:                          512 KiB (16 instances)
L2 cache:                           8 MiB (16 instances)
L3 cache:                           64 MiB (4 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] optree==0.11.0
[pip3] sentence-transformers==2.7.0
[pip3] torch==2.3.1
[pip3] torcheval==0.0.7
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.2
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  0-31    0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks```

🐛 Describe the bug

After running the code

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from outlines.integrations.vllm import RegexLogitsProcessor

import os
os.environ["HF_TOKEN"] = ""

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", enable_prefix_caching=True)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

proc = RegexLogitsProcessor(r'yes|no', llm)
sampling_params = SamplingParams(temperature=0.6, top_p=0.15, max_tokens=1, logits_processors=[proc])

prompts = ["some long text up to the max model length / 20000 chars", "some long text up to the max model length / 20000 chars", ...] <- list of length 100 to 1000

formatted_prompts = []
for prompt in prompts:
    messages = [{"role": "user", "content": prompt["prompt"]}]
    formatted_prompts.append(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

output = llm.generate(formatted_prompts, sampling_params)

I get an error

RuntimeError: CUDA error: an illegal memory access was encountered

The error seems to happen randomly and sometimes I don't get an error running the same command in the same environment and versions,

I have done the following investigations and can confirm:

I have seen quite a few different issues with enable_prefix_caching, could anyone comment if the feature actually worked for them? We have a lot of 80-90% repetitive prompts in our use cases so prefix caching provides dramatic speed-up. Would be grateful for any suggestions!

Full error detail ``` Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. File , line 6 4 os.environ["VLLM_TRACE_FUNCTION"]="TRACE" 5 os.environ["CUDA_LAUNCH_BLOCKING"]="1" ----> 6 output = llm.generate(formatted_prompts[300:1000], sampling_params) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/utils.py:838, in deprecate_kwargs..wrapper..inner(*args, **kwargs) 831 msg += f" {additional_message}" 833 warnings.warn( 834 DeprecationWarning(msg), 835 stacklevel=3, # The inner function takes up one level 836 ) --> 838 return fn(*args, **kwargs) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/entrypoints/llm.py:316, in LLM.generate(self, prompts, sampling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request) 308 sampling_params = SamplingParams() 310 self._validate_and_add_requests( 311 inputs=inputs, 312 params=sampling_params, 313 lora_request=lora_request, 314 prompt_adapter_request=prompt_adapter_request) --> 316 outputs = self._run_engine(use_tqdm=use_tqdm) 317 return LLMEngine.validate_outputs(outputs, RequestOutput) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/entrypoints/llm.py:569, in LLM._run_engine(self, use_tqdm) 567 total_out_toks = 0 568 while self.llm_engine.has_unfinished_requests(): --> 569 step_outputs = self.llm_engine.step() 570 for output in step_outputs: 571 if output.finished: File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/engine/llm_engine.py:911, in LLMEngine.step(self) 901 finished_requests_ids = self.scheduler[ 902 0].get_and_reset_finished_requests_ids() 903 execute_model_req = ExecuteModelRequest( 904 seq_group_metadata_list=seq_group_metadata_list, 905 blocks_to_swap_in=scheduler_outputs.blocks_to_swap_in, (...) 909 running_queue_size=scheduler_outputs.running_queue_size, 910 finished_requests_ids=finished_requests_ids) --> 911 output = self.model_executor.execute_model( 912 execute_model_req=execute_model_req) 913 else: 914 output = [] File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/executor/gpu_executor.py:110, in GPUExecutor.execute_model(self, execute_model_req) 107 def execute_model( 108 self, execute_model_req: ExecuteModelRequest 109 ) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]: --> 110 output = self.driver_worker.execute_model(execute_model_req) 111 return output File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/worker/worker_base.py:272, in LocalOrDistributedWorkerBase.execute_model(self, execute_model_req) 268 if not get_pp_group().is_first_rank: 269 intermediate_tensors = IntermediateTensors( 270 get_pp_group().recv_tensor_dict()) --> 272 output = self.model_runner.execute_model( 273 model_input, self.kv_cache[worker_input.virtual_engine] 274 if self.kv_cache is not None else None, intermediate_tensors, 275 num_steps) 277 if not get_pp_group().is_last_rank: 278 # output is IntermediateTensors 279 get_pp_group().send_tensor_dict(output.tensors) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs) 112 @functools.wraps(func) 113 def decorate_context(*args, **kwargs): 114 with ctx_factory(): --> 115 return func(*args, **kwargs) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/worker/model_runner.py:1334, in ModelRunner.execute_model(self, model_input, kv_caches, intermediate_tensors, num_steps) 1331 return [] 1333 # Sample the next token. -> 1334 output: SamplerOutput = self.model.sample( 1335 logits=logits, 1336 sampling_metadata=model_input.sampling_metadata, 1337 ) 1339 if self.return_hidden_states: 1340 # we only need to pass hidden states of most recent token 1341 assert model_input.sampling_metadata is not None File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/models/llama.py:437, in LlamaForCausalLM.sample(self, logits, sampling_metadata) 432 def sample( 433 self, 434 logits: torch.Tensor, 435 sampling_metadata: SamplingMetadata, 436 ) -> Optional[SamplerOutput]: --> 437 next_tokens = self.sampler(logits, sampling_metadata) 438 return next_tokens File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs) 1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1531 else: -> 1532 return self._call_impl(*args, **kwargs) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs) 1536 # If we don't have any hooks, we want to skip the rest of the logic in 1537 # this function, and just call forward. 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1539 or _global_backward_pre_hooks or _global_backward_hooks 1540 or _global_forward_hooks or _global_forward_pre_hooks): -> 1541 return forward_call(*args, **kwargs) 1543 try: 1544 result = None File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py:91, in Sampler.forward(self, logits, sampling_metadata) 89 # Prepare sampling tensors with pinned memory to avoid blocking. 90 if not sampling_metadata.reuse_sampling_tensors: ---> 91 self._init_sampling_tensors(logits, sampling_metadata) 92 elif self._do_penalties: 93 # In this case, the sampling tensors logic depends on 94 # "output_tokens" of a sequence. As a result, we cannot 95 # reuse sampling tensors, since "output_tokens" changes 96 # between decode runs. 97 self._init_sampling_tensors(logits, sampling_metadata) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py:68, in Sampler._init_sampling_tensors(self, logits, sampling_metadata) 64 self._sampling_tensors = None 66 # Initialize new sampling tensors 67 (sampling_tensors, do_penalties, do_top_p_top_k, ---> 68 do_min_p) = SamplingTensors.from_sampling_metadata( 69 sampling_metadata, vocab_size, logits.device, logits.dtype) 71 self._sampling_tensors = sampling_tensors 72 self._do_penalties = do_penalties File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/sampling_metadata.py:443, in SamplingTensors.from_sampling_metadata(cls, sampling_metadata, vocab_size, device, dtype, extra_seeds_to_generate, extra_entropy) 440 prompt_tokens.append(list(seq_data.prompt_token_ids)) 441 output_tokens.append(list(seq_data.output_token_ids)) --> 443 sampling_tensors = SamplingTensors.from_lists( 444 temperatures, top_ps, top_ks, min_ps, presence_penalties, 445 frequency_penalties, repetition_penalties, sampling_seeds, 446 sample_indices, prompt_tokens, output_tokens, vocab_size, 447 extra_seeds_to_generate, device, dtype) 448 return (sampling_tensors, do_penalties, do_top_p_top_k, do_min_p) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/sampling_metadata.py:487, in SamplingTensors.from_lists(cls, temperatures, top_ps, top_ks, min_ps, presence_penalties, frequency_penalties, repetition_penalties, sampling_seeds, sample_indices, prompt_tokens, output_tokens, vocab_size, extra_seeds_to_generate, device, dtype) 484 prompt_t = empty_tensor 485 output_t = empty_tensor --> 487 temperatures_t = torch.tensor( 488 temperatures, 489 device="cpu", 490 dtype=dtype, 491 pin_memory=pin_memory, 492 ) 493 top_ps_t = torch.tensor( 494 top_ps, 495 device="cpu", 496 dtype=dtype, 497 pin_memory=pin_memory, 498 ) 499 min_ps_t = torch.tensor( 500 min_ps, 501 device="cpu", 502 dtype=dtype, 503 pin_memory=pin_memory, 504 ) ```
robertgshaw2-neuralmagic commented 3 months ago

Can you share the exact prompts you are sending? This issue occurs sporadically, so detailed reproduction instructions would be very beneficial for us

captify-sivakhno commented 3 months ago

@robertgshaw2-neuralmagic thanks for fast reply, here's the link to a file with 5000 prompts

formatted_prompts.txt.zip

generated as

with open('/Volumes/qa/tv_segmentation_bronze/misc/formatted_prompts.txt', 'w') as f:
    for item in formatted_prompts:
        f.write("%s\n" % item)

This is what went into the input

output = llm.generate(formatted_prompts, sampling_params)
captify-sivakhno commented 3 months ago

BTW @robertgshaw2-neuralmagic if you have access to Databricks, one option to easily and fully reproduce environment is running in notebook on 15.4 LTS ML Beta (15.4.x-gpu-ml-scala2.12) runtime, as that's where I ran it.

captify-sivakhno commented 3 months ago

@robertgshaw2-neuralmagic - regarding your comment about the prompts content above, any suggestions as to which properties of prompts might be causing the error. I have rerun by re-using only the first prompt as an example

# other code as before
output = llm.generate(formatted_prompts[0]*len(formatted_prompts), sampling_params)

and it completed fine. This is encouraging, but the range of error possibilities is quite high (length of prompt, token composition, pattern of cache reuse, etc)

mengban commented 3 months ago

mark. met same problem

chenchunhui97 commented 3 months ago

mark, met same problem in v0.5.0post1

Playerrrrr commented 3 months ago

same

zachzzc commented 3 months ago

Also seeing the same problem and I found the issues arises at the time when a cached prefill request scheduled together with non-cached request. The problem is gone if I force it to only schedule one prefill request. Still debugging.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!