LLM Batch Processing - Githubissues

hanchingyong commented 3 months ago

To allow multi-tenancy inferences

[ ] Explore vLLM/HuggingFace TGI
[ ] Fallback implement baseline FastAPI with batch processing

jianyangg commented 3 months ago

I've just tested Ollama's concurrency feature using the following docker-compose.yml file. We are able to retain the same speed of answer generation despite running 4 instances of Ollama at the same time. Memory usage was as follow: GPU memory was around 6.3GB and CPU memory was around 34.5GB. Will be exploring vLLM next.

version: '3.8'

services:
  ollama:
    image: ollama/ollama
    expose:
     - 11434/tcp
    ports:
     - 11434:11434/tcp
    healthcheck:
      test: ollama --version || exit 1
    command: serve
    volumes:
      - ollama:/root/.ollama
    environment:
      OLLAMA_NUM_PARALLEL: "4"
      OLLAMA_MAX_LOADED_MODELS: "4"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['all']
              capabilities: [gpu]

volumes:
  ollama:

jianyangg commented 1 month ago

Closed as settling for Ollama for now.

jianyangg / local-llm

LLM Batch Processing #13