huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.65k stars 165 forks source link

CPU Image: High memory usage on startup #303

Open freinold opened 3 months ago

freinold commented 3 months ago

System Info

Image: v1.2 CPU Model used: jinaai/jina-embeddings-v2-base-de Deployment: Docker / RH OpenShift

Information

Tasks

Reproduction

  1. Run the CPU image with following compose.yaml
    version: '3.8'
    name: test-tei
    services:
    tei:
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2
    command: ["--tokenization-workers", "1"]
    environment:
      MODEL_ID: "jinaai/jina-embeddings-v2-base-de"
      REVISION: "5078d9924a7b3bdd9556928fcfc08b8de041bfc1"
      MAX_CLIENT_BATCH_SIZE: 64
    volumes:
      - ./tei-docker-data:/data
    ports:
      - "8081:80"
  2. Monitor memory usage (e.g. via Docker Desktop)
  3. After downloading model and log line INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:121: Starting JinaBertModel model on Cpu, memory spikes to >8GB for a second
  4. Memory usage after startup is down to <4GB and stays there

Expected behavior

The Container should not produce big memory spikes only during model load that can cause resource errors. Otherwise Kubernetes Deployments may need to provision double the resources really needed for inference for each container, leading to a huge amount of unused memory capacity.

I tried to deploy this to a RH OpenShift cluster with hard pod memory limits of 4GB and failed because of this, although after startup the container never needs more than 4GB of memory for handling requests and inference, only on startup.

freinold commented 3 months ago

This is probably related to the Implementation of JinaBert. When trying a model with another architecture like intfloat/multilingual-e5-large, i don't get this behavior.