enforce_eager flag - Githubissues

dannysemi commented 6 months ago

2024-02-02 21:44:03.976
[b2k2ml81pl56tl]
[info]
engine.py :190 2024-02-03 03:44:03,976 Error initializing vLLM engine: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 53.38 MiB is free. Process 3663367 has 44.28 GiB memory in use. Of the allocated memory 37.38 GiB is allocated by PyTorch, and 275.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2024-02-02 21:43:49.554
[b2k2ml81pl56tl]
[info]
INFO 02-03 03:43:49 model_runner.py:629] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-02-02 21:43:49.554
[b2k2ml81pl56tl]
[info]
INFO 02-03 03:43:49 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-02-02 21:43:48.339
[b2k2ml81pl56tl]
[info]
INFO 02-03 03:43:48 llm_engine.py:316] # GPU blocks: 7754, # CPU blocks: 2048
2024-02-02 21:43:24.749
[b2k2ml81pl56tl]
[info]
INFO 02-03 03:43:17 weight_utils.py:164] Using model weights format ['*.safetensors']

I get this error when trying to run gptq models. I've built the worker myself with enforce_eager set to True and it works. Maybe an environment variable? Or could there be something else preventing this model from utilizing the proper amount of vram?

alpayariyak commented 6 months ago

Try lowering GPU_MEMORY_UTILIZATION first

dannysemi commented 6 months ago

Try lowering GPU_MEMORY_UTILIZATION first

I set it to 0.8 and it worked. Does this reduce the model's performance?

alpayariyak commented 6 months ago

It should lower the total number of requests you can handle concurrently. Try setting a higher number (but lower than the default 0.95), and work your way down until it doesn't crash.

alpayariyak commented 6 months ago

Disabling CUDA Graphs by setting enforce eager to true would lower performance though, so lowering memory utilization is better.

dannysemi commented 6 months ago

Thank you. I'll figure out a number that works.

runpod-workers / worker-vllm

enforce_eager flag #40