ray-project / ray-llm

RayLLM - LLMs on Ray
https://aviary.anyscale.com
Apache License 2.0
1.22k stars 91 forks source link

How to use partial GPU? #117

Open rifkybujana opened 8 months ago

rifkybujana commented 8 months ago

Hi, I wonder if it's possible to use a partial portion of GPU per model instead of using 1 GPU for each model deployed? As an example, when using a G5.12xlarge instance in AWS with 4 GPUs, instead of deploying on a maximum of 4 models, by using half of the GPU, it might able to deploy eight models with quantization. Changing the num gpus per worker resulted in error.

sihanwang41 commented 8 months ago

Hi @rifkybujana , what if you change num_workers to 2, and keep num gpu per worker as it is.

lizzzcai commented 8 months ago

Any update on this? I am doing a similar test and want to know what is the best practice for deploying 8 models in a 4 GPUs instance. what is the different between 1 worker, 4 replicas and 4 workers with 1 replica each? Thanks.