I'm confident that a feature enabling multi-GPU optimization and batch management would be beneficial.
I may have made a mistake, as I couldn't effectively use the ollama_num_parallel and ollama_max_loaded_models settings to optimize my Linux VM, which has four A100 80GB GPUs, using Llama3:70b-instruct.
I finally succeed to use the 4 GPUs in parallel, thanks to separate docker containers assigned to different ports. I also used AsyncClient() with Asyncio for effective asynchronous operations.
In any case, I'm happy to share my code if it might help someone.
Batch processing, Ollama client and queue management
async def ollama_chat_batches(df, client_pool, sys_instruction, model_name):
nb_thread = len(df['id_msg'])
# Create an empty queue:
task_queue = asyncio.Queue()
# Build and add each task to the queue:
for i in range(0, nb_questions, 4):
for j in range(len(client_pool)):
if i + j < nb_questions:
id_question = df['id_question'][i + j]
question = df['question'][i + j]
messages = [
{'role': "system", 'content': sys_instruction},
{'role': "user", 'content': question}
]
task = asyncio.ensure_future(ollama_chat_solo(client_pool[j % len(client_pool)], messages, model_name))
await task_queue.put((id_question, task))
# Process tasks in the order they were added to the queue
responses = []
while not task_queue.empty():
thread_id, task = await task_queue.get()
response = await task # Wait for task completion
if response is not None:
responses.append((thread_id, response)) # Store response with thread ID
return responses
Calling
model_name = 'llama3:70b-instruct'
client_pool = [AsyncClient(host='http://localhost:{}'.format(port)) for port in range(11435, 11439)]
sys_instruction = f"""You are an expert in geographic. Answer the question."""
responses = await ollama_chat_batches(questions_df, client_pool, sys_instruction, model_name)
Hello,
I'm confident that a feature enabling multi-GPU optimization and batch management would be beneficial.
I may have made a mistake, as I couldn't effectively use the
ollama_num_parallel
andollama_max_loaded_models
settings to optimize my Linux VM, which has four A100 80GB GPUs, using Llama3:70b-instruct.I finally succeed to use the 4 GPUs in parallel, thanks to separate docker containers assigned to different ports. I also used AsyncClient() with Asyncio for effective asynchronous operations.
In any case, I'm happy to share my code if it might help someone.
Assign docker containers to GPU and ports
Pull llama3:70b-instruct
Python import
Chat Ollama with an asynchronous python function
Batch processing, Ollama client and queue management
Calling