Multi-node production GenAI stack. Run the best of open source AI easily on your own servers. Easily add knowledge from documents and scrape websites. Create your own AI by fine-tuning open source models. Integrate LLMs with APIs. Run gptscript securely on the server
hmm llama3:70b seems slow on this a100 even though it has the whole gpu
i wonder if ollama started while another process was still freeing gpu memory, and so fell back to putting some weights on cpu
ideally we would make ollama fail if it can’t put all the weights on gpu
two approaches:
1) fail if we end up not with all our weights in GPU memory
2) ensure we've definitely freed the amount of memory we think we should have freed before we start a model instance. I think it’s because unloading GPU memory isn’t synchronous with calling SIGTERM on a process. we need to wait until the memory is actually free
on 1), see this chat on discord for implementation ideas
hmm llama3:70b seems slow on this a100 even though it has the whole gpu
i wonder if ollama started while another process was still freeing gpu memory, and so fell back to putting some weights on cpu
ideally we would make ollama fail if it can’t put all the weights on gpu
two approaches: 1) fail if we end up not with all our weights in GPU memory 2) ensure we've definitely freed the amount of memory we think we should have freed before we start a model instance. I think it’s because unloading GPU memory isn’t synchronous with calling SIGTERM on a process. we need to wait until the memory is actually free
on 1), see this chat on discord for implementation ideas