Runpod serverless vLLM with Llama 3 70B on 40GB GPU

EdwardTheLegend commented 3 months ago

Im running a runpod serverless vLLM template with Llama 3 70B on 40GB GPU. One of the requests failed and I'm not completely sure what happened but the message asked me to open a github issue so I'll just leave it here in case it's of any help to any one.

{
  "delayTime": 164,
  "error": "handler: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
traceback: Traceback (most recent call last):
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    task.result()
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 414, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 393, in engine_step
    request_outputs = await self.engine.step_async()
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 189, in step_async
    all_outputs = await self._run_workers_async(
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 276, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-installation/vllm/worker/worker.py", line 223, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-installation/vllm/worker/model_runner.py", line 582, in execute_model
    hidden_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-installation/vllm/model_executor/models/llama.py", line 337, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-installation/vllm/model_executor/models/llama.py", line 267, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-installation/vllm/model_executor/models/llama.py", line 226, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-installation/vllm/model_executor/models/llama.py", line 77, in forward
    gate_up, _ = self.gate_up_proj(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-installation/vllm/model_executor/layers/linear.py", line 215, in forward
    output_parallel = self.linear_method.apply_weights(
  File "/vllm-installation/vllm/model_executor/layers/quantization/awq.py", line 158, in apply_weights
    out = ops.awq_dequantize(qweight, scales, qzeros, 0, 0, 0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 493.75 MiB is free. Process 2531074 has 43.85 GiB memory in use. Of the allocated memory 39.71 GiB is allocated by PyTorch, and 1.16 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py", line 194, in run_job_generator
    async for output_partial in job_output:
  File "/src/handler.py", line 13, in handler
    async for batch in results_generator:
  File "/src/engine.py", line 132, in generate
    async for response in self._handle_chat_or_completion_request(openai_request):
  File "/src/engine.py", line 166, in _handle_chat_or_completion_request
    async for chunk_str in response_generator:
  File "/vllm-installation/vllm/entrypoints/openai/serving_chat.py", line 148, in chat_completion_stream_generator
    async for res in result_generator:
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 577, in generate
    raise e
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 571, in generate
    async for request_output in stream:
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 69, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py", line 194, in run_job_generator
    async for output_partial in job_output:
  File "/src/handler.py", line 13, in handler
    async for batch in results_generator:
  File "/src/engine.py", line 132, in generate
    async for response in self._handle_chat_or_completion_request(openai_request):
  File "/src/engine.py", line 166, in _handle_chat_or_completion_request
    async for chunk_str in response_generator:
  File "/vllm-installation/vllm/entrypoints/openai/serving_chat.py", line 148, in chat_completion_stream_generator
    async for res in result_generator:
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 577, in generate
    raise e
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 571, in generate
    async for request_output in stream:
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 69, in __anext__
    raise result
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    raise exc
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
",
  "executionTime": 70626,
  "id": "b28e9cd8-88e2-4485-aa00-7115aba3457c-e1",
  "status": "FAILED"
}

bryankruman commented 3 months ago

Just curious, what model/what environment variables do you have for your configuration of this endpoint?

EdwardTheLegend commented 3 months ago

No environment variables set apart from the model which was meta-llama/Meta-Llama-3-70B. I've since switched to a quantized version someone else made casperhansen/llama-3-70b-instruct-awq which works for me so this can be closed if you want.

alpayariyak commented 2 months ago

Yes, unquantified 70B won't fit on 48GB

enpro-github commented 2 months ago

Hey I'm curious, what docker container did you use in the template? I can't get anything to work with LLama 3 so far

alpayariyak commented 2 months ago

Which llama 3 size, which GPUs were you trying to use and in what quantity per worker?

enpro-github commented 2 months ago

Llama-3-8B(https://huggingface.co/meta-llama/Meta-Llama-3-8B). I was using the Nvidia 4090, 12 vCPU, 25GB RAM. 1 GPU/worker

ashleykleynhans commented 2 months ago

Llama-3-8B(https://huggingface.co/meta-llama/Meta-Llama-3-8B). I was using the Nvidia 4090, 12 vCPU, 25GB RAM. 1 GPU/worker

1 x 24GB GPU should be fine for this, what do your environment variable settings look like, and since you're using a different configuration than 70B its not related to this issue, so you shouldn't be hijacking this issue with unrelated queries.

alpayariyak commented 2 months ago

Have you tried launching it from the UI? Also, llama 3 is a gated model, so you need to gain access by filling out the form on the model's page on Hugging Face and provide your HuggingFace token when deploying. Alternatively, you can deploy an ungated reupload, such as NousResearch/Meta-Llama-3-8B-Instruct, which requires no token.

runpod-workers / worker-vllm

Runpod serverless vLLM with Llama 3 70B on 40GB GPU #68