Closed hanchingyong closed 1 month ago
I've just tested Ollama's concurrency feature using the following docker-compose.yml file. We are able to retain the same speed of answer generation despite running 4 instances of Ollama at the same time. Memory usage was as follow: GPU memory was around 6.3GB and CPU memory was around 34.5GB. Will be exploring vLLM next.
version: '3.8'
services:
ollama:
image: ollama/ollama
expose:
- 11434/tcp
ports:
- 11434:11434/tcp
healthcheck:
test: ollama --version || exit 1
command: serve
volumes:
- ollama:/root/.ollama
environment:
OLLAMA_NUM_PARALLEL: "4"
OLLAMA_MAX_LOADED_MODELS: "4"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['all']
capabilities: [gpu]
volumes:
ollama:
Closed as settling for Ollama for now.
To allow multi-tenancy inferences