fail ollama if it's not fully on gpu / ensure gpu mem freed before starting model instance

lukemarsden commented 5 months ago

hmm llama3:70b seems slow on this a100 even though it has the whole gpu

i wonder if ollama started while another process was still freeing gpu memory, and so fell back to putting some weights on cpu

ideally we would make ollama fail if it can’t put all the weights on gpu

two approaches: 1) fail if we end up not with all our weights in GPU memory 2) ensure we've definitely freed the amount of memory we think we should have freed before we start a model instance. I think it’s because unloading GPU memory isn’t synchronous with calling SIGTERM on a process. we need to wait until the memory is actually free

on 1), see this chat on discord for implementation ideas

lukemarsden commented 5 months ago

actually, looks like shutting down ollama doesn't actually stop it

lukemarsden commented 5 months ago

https://mlops-community.slack.com/archives/C0675EX9V2Q/p1719230632320479

lukemarsden commented 5 months ago

probably fixed in #340

helixml / helix

fail ollama if it's not fully on gpu / ensure gpu mem freed before starting model instance #334