jianyangg / local-llm

DSTA Internship Project
0 stars 0 forks source link

LLM Batch Processing #13

Closed hanchingyong closed 1 month ago

hanchingyong commented 3 months ago

To allow multi-tenancy inferences

jianyangg commented 3 months ago

I've just tested Ollama's concurrency feature using the following docker-compose.yml file. We are able to retain the same speed of answer generation despite running 4 instances of Ollama at the same time. Memory usage was as follow: GPU memory was around 6.3GB and CPU memory was around 34.5GB. Will be exploring vLLM next.

version: '3.8'

services:
  ollama:
    image: ollama/ollama
    expose:
     - 11434/tcp
    ports:
     - 11434:11434/tcp
    healthcheck:
      test: ollama --version || exit 1
    command: serve
    volumes:
      - ollama:/root/.ollama
    environment:
      OLLAMA_NUM_PARALLEL: "4"
      OLLAMA_MAX_LOADED_MODELS: "4"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['all']
              capabilities: [gpu]

volumes:
  ollama:
jianyangg commented 1 month ago

Closed as settling for Ollama for now.