helixml / helix

Multi-node production GenAI stack. Run the best of open source AI easily on your own servers. Easily add knowledge from documents and scrape websites. Create your own AI by fine-tuning open source models. Integrate LLMs with APIs. Run gptscript securely on the server
https://tryhelix.ai
Other
352 stars 31 forks source link

fail ollama if it's not fully on gpu / ensure gpu mem freed before starting model instance #334

Closed lukemarsden closed 5 months ago

lukemarsden commented 5 months ago

hmm llama3:70b seems slow on this a100 even though it has the whole gpu

i wonder if ollama started while another process was still freeing gpu memory, and so fell back to putting some weights on cpu

ideally we would make ollama fail if it can’t put all the weights on gpu

two approaches: 1) fail if we end up not with all our weights in GPU memory 2) ensure we've definitely freed the amount of memory we think we should have freed before we start a model instance. I think it’s because unloading GPU memory isn’t synchronous with calling SIGTERM on a process. we need to wait until the memory is actually free

on 1), see this chat on discord for implementation ideas image

lukemarsden commented 5 months ago

actually, looks like shutting down ollama doesn't actually stop it

lukemarsden commented 5 months ago

https://mlops-community.slack.com/archives/C0675EX9V2Q/p1719230632320479

lukemarsden commented 5 months ago

probably fixed in #340