Memory efficiency - Githubissues

Huggingface reads the whole model into main memory and from there to GPU memory. So for a 27GB model we need 27GB of RAM during startup and afterwards we do not need it.

I remember there is a way to stream the model into the GPU memory instead of above behaviour. We should implement this to safe a large portion of RAM.

jnehring / llm_tools

Memory efficiency #9