Hello,

I'm confident that a feature enabling multi-GPU optimization and batch management would be beneficial.

I may have made a mistake, as I couldn't effectively use the ollama_num_parallel and ollama_max_loaded_models settings to optimize my Linux VM, which has four A100 80GB GPUs, using Llama3:70b-instruct.

I finally succeed to use the 4 GPUs in parallel, thanks to separate docker containers assigned to different ports. I also used AsyncClient() with Asyncio for effective asynchronous operations.

In any case, I'm happy to share my code if it might help someone.

Assign docker containers to GPU and ports

sudo docker run -d --gpus=1 -v ollama:/root/.ollama -p 11435:11434 --name ollama0 ollama/ollama:latest
sudo docker run -d --gpus=2 -v ollama:/root/.ollama -p 11436:11434 --name ollama1 ollama/ollama:latest
sudo docker run -d --gpus=3 -v ollama:/root/.ollama -p 11437:11434 --name ollama2 ollama/ollama:latest
sudo docker run -d --gpus=all -v ollama:/root/.ollama -p 11438:11434 --name ollama3 ollama/ollama:latest

Pull llama3:70b-instruct

sudo docker exec -it ollama0 ollama pull llama3:70b-instruct
sudo docker exec -it ollama1 ollama pull llama3:70b-instruct
sudo docker exec -it ollama2 ollama pull llama3:70b-instruct
sudo docker exec -it ollama3 ollama pull llama3:70b-instruct

Python import

import asyncio
import ollama
from ollama import AsyncClient

Chat Ollama with an asynchronous python function

async def ollama_chat_solo(client, messages, model_name):
    response = await client.chat(model=model_name, messages=messages, keep_alive=-1)
    return response

Batch processing, Ollama client and queue management

async def ollama_chat_batches(df, client_pool, sys_instruction, model_name):

    nb_thread = len(df['id_msg'])

    # Create an empty queue:
    task_queue = asyncio.Queue()

    # Build and add each task to the queue:
    for i in range(0, nb_questions, 4):
        for j in range(len(client_pool)):
            if i + j < nb_questions:
                id_question = df['id_question'][i + j]
                question = df['question'][i + j]

                messages = [
                    {'role': "system", 'content': sys_instruction},
                    {'role': "user", 'content': question}
                ]
                task = asyncio.ensure_future(ollama_chat_solo(client_pool[j % len(client_pool)], messages, model_name))

                await task_queue.put((id_question, task))

    # Process tasks in the order they were added to the queue
    responses = []
    while not task_queue.empty():
        thread_id, task = await task_queue.get()
        response = await task  # Wait for task completion
        if response is not None:
            responses.append((thread_id, response))  # Store response with thread ID

    return responses

Calling

model_name = 'llama3:70b-instruct'

client_pool = [AsyncClient(host='http://localhost:{}'.format(port)) for port in range(11435, 11439)]

sys_instruction = f"""You are an expert in geographic. Answer the question."""

responses = await ollama_chat_batches(questions_df, client_pool, sys_instruction, model_name)

ollama / ollama

Multi-GPU and batch management #4752