Huggingface reads the whole model into main memory and from there to GPU memory. So for a 27GB model we need 27GB of RAM during startup and afterwards we do not need it.
I remember there is a way to stream the model into the GPU memory instead of above behaviour. We should implement this to safe a large portion of RAM.
Huggingface reads the whole model into main memory and from there to GPU memory. So for a 27GB model we need 27GB of RAM during startup and afterwards we do not need it.
I remember there is a way to stream the model into the GPU memory instead of above behaviour. We should implement this to safe a large portion of RAM.