Closed chandeldivyam closed 2 months ago
Can you try setting gpu_memory_utilization
to a higher value? It controls the proportion of GPU memory vLLM is allowed to use (the default is 0.9).
Thanks for the response, tried 3 values = [1, 0.9, 0.7] but neither worked. @DarkLight1337
Thanks to @ywang96 we have figured out the reason. The model has 128k context length by default so it might not fit in your GPU. Try passing max_model_len=8192
(or some other value that lets it fit in your GPU) to LLM
in the example.
Thanks @DarkLight1337 and @ywang96
It worked like charm
Anything you want to discuss about vllm.
I was wondering why does this happen? I am new to this space and was playing around with different machines, models and frameworks.
I am able to inference single image (on RTX3070) in around 70s using huggingface transformer. Tried similar thing using vllm (current main branch), it got out of memory which got me curious.
Vllm