runpod-workers / worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
MIT License
213 stars 82 forks source link

OOM on second request #78

Closed Permafacture closed 2 months ago

Permafacture commented 2 months ago

I'm using MODEL_NAME=TheBloke/dolphin-2.2.1-mistral-7B-AWQ and QUANTIZATION=awq on a runpod serverless instance with a network drive, RTX 4090 which should be plenty of VRAM for this, and docker image runpod/worker-vllm:stable-cuda12.1.0

My first request completes successfully, but the second request to the same worker (sent after the first has completed) always crashes with OOM. If I log into the web terminal, nvidia-smi says all the vram is taken but lists no process as responsible.

Here's the code I'm using. I just run this once, wait for it to complete, and then run it again.

from openai import OpenAI
import os

api_key="****"
endpoint_id="*****"
model_name = "TheBloke/dolphin-2.2.1-mistral-7B-AWQ"

# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
client = OpenAI(
             api_key=api_key,
             base_url=f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1",
             )

completion = "In the world of oncology, Pik3CA is"

print(f"Non-streaming completion of prompt: {completion}")
response = client.completions.create(
    model=model_name,
    prompt=completion,
    temperature=0,
    max_tokens=100,
)
# Print the response
print(response.choices[0].text

Additional information

Screen Shot 2024-06-24 at 5 56 00 PM

Screen Shot 2024-06-24 at 4 23 21 PM

alpayariyak commented 2 months ago

Hi, try setting env variable GPU_MEMORY_UTILIZATION to 0.9

Permafacture commented 2 months ago

That seems to have fixed it. Thanks! The README says the default is .95 though in the container there was no value for that environment variable set until I set it.