vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.83k stars 3.42k forks source link

Illegal Memory Access issue randomly while running #1214

Closed LOCKhart07 closed 4 months ago

LOCKhart07 commented 10 months ago

Running in a docker container. After this, the subsequent api requests returned 'Internal server error'


INFO 09-28 06:39:34 llm_engine.py:613] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 17.1%, CPU KV cache usage: 0.0%
INFO 09-28 06:39:39 llm_engine.py:613] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 17.4%, CPU KV cache usage: 0.0%
INFO 09-28 06:39:44 llm_engine.py:613] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 17.8%, CPU KV cache usage: 0.0%
INFO 09-28 06:39:49 llm_engine.py:613] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 18.0%, CPU KV cache usage: 0.0%
Exception in callback _raise_exception_on_finish(request_tracker=<vllm.engine....x7f71353d42b0>)(<Task finishe...sertions.\n')>) at /data/vllm/vllm/engine/async_llm_engine.py:21
handle: <Handle _raise_exception_on_finish(request_tracker=<vllm.engine....x7f71353d42b0>)(<Task finishe...sertions.\n')>) at /data/vllm/vllm/engine/async_llm_engine.py:21>
Traceback (most recent call last):
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 27, in _raise_exception_on_finish
    task.result()
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 316, in run_engine_loop
    await self.engine_step()
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 301, in engine_step
    request_outputs = await self.engine.step_async()
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 173, in step_async
    output = await self._run_workers_async(
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 198, in _run_workers_async
    output = executor(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/vllm/vllm/worker/worker.py", line 293, in execute_model
    output = self.model(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/vllm/vllm/model_executor/models/llama.py", line 262, in forward
    next_tokens = self.sampler(self.lm_head.weight, hidden_states,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/vllm/vllm/model_executor/layers/sampler.py", line 44, in forward
    hidden_states = _prune_hidden_states(hidden_states, input_metadata)
  File "/data/vllm/vllm/model_executor/layers/sampler.py", line 104, in _prune_hidden_states
    0, torch.tensor(last_token_indicies, device=hidden_states.device))
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish
    raise exc
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 31, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 09-28 06:39:52 async_llm_engine.py:120] Aborted request cmpl-cdd04d1887cd476eb6baf43e83c45991.
INFO:     172.17.0.1:54394 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 27, in _raise_exception_on_finish
    task.result()
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 316, in run_engine_loop
    await self.engine_step()
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 301, in engine_step
    request_outputs = await self.engine.step_async()
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 173, in step_async
    output = await self._run_workers_async(
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 198, in _run_workers_async
    output = executor(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/vllm/vllm/worker/worker.py", line 293, in execute_model
    output = self.model(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/vllm/vllm/model_executor/models/llama.py", line 262, in forward
    next_tokens = self.sampler(self.lm_head.weight, hidden_states,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/vllm/vllm/model_executor/layers/sampler.py", line 44, in forward
    hidden_states = _prune_hidden_states(hidden_states, input_metadata)
  File "/data/vllm/vllm/model_executor/layers/sampler.py", line 104, in _prune_hidden_states
    0, torch.tensor(last_token_indicies, device=hidden_states.device))
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/data/vllm/vllm/entrypoints/openai/api_server.py", line 522, in create_completion
    async for res in result_generator:
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 391, in generate
    raise e
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 386, in generate
    async for request_output in stream:
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 69, in __anext__
    raise result
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish
    raise exc
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 31, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.```
viktor-ferenczi commented 10 months ago

Please provide the OS, CUDA version, CPU, CPU RAM, GPU(s), GPU VRAM sizes, command line you started the vLLM with, model used, prompt(s) and the full vLLM log output for diagnosis.

It may also be useful to know where did you run that Docker container.

Do you have a way to reproduce the problem with a locally running Docker container?

If it is a random issue, then how frequent is it?

LOCKhart07 commented 10 months ago

OS: Centos 7 Cuda version: 12.2 CPU: AMD EPYC 7V13 64-Core Processor CPU RAM: 432G GPU: 2x NVIDIA A100 80GB

command used: python -m vllm.entrypoints.openai.api_server --model /data/LLAMA_CODE/codellama/CodeLlama-13b-Instruct/hf --tensor-parallel-size 1 --port 80 --host 0.0.0.0 --served-model codellama-13b-instruct --gpu-memory-utilization 0.75

model: codellama-13b-instruct . Also has happened once with llama-2-13b-chat

Can't provide with prompts and the full vllm output as that has confidential information

Wasn't able to reproduce the issue manually.

viktor-ferenczi commented 10 months ago

Please try to limit the vLLM process to a certain set of CPU cores using taskset.

Ideally you will find a set of cores which works without crashing. It will serve as a workaround and helps narrowing down the issue further.

Ideally you will need to try only some "logical" combinations, based on your CPU's architecture. Please see cat /proc/cpuinfo or search for more documentation on that CPU. You can start only from a few cores, then increase the number from there if it is stable until it crashes. Or you can try halving the cores and narrow down. Either way should work.

If no stable set of CPU exists (other than a single core), then the problem is somewhere else.

LOCKhart07 commented 10 months ago

It crashed again. Just before the crash, GPU KV cache usage was around 18% again

michaelroyzen commented 10 months ago

We had this issue as well with vLLM version 0.1.7. Please install the latest version -- it should fix it.

hmellor commented 4 months ago

Closing this issue as stale as there has been no discussion in the past 3 months.

If you are still experiencing the issue you describe, feel free to re-open this issue.