huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.28k stars 137 forks source link

Dockerized text-embeddings-inference:cpu-1.0 /embed endpoint issue #305

Open liltimtim opened 1 week ago

liltimtim commented 1 week ago

System Info

Sample Docker Compose File

  embedding:
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
    platform: linux/amd64
    volumes:
      - embed_data:/data
    command: --model-id BAAI/bge-small-en-v1.5
    ports:
      - "8080:80"

When hitting endpoint /embed over and over with the following data

{
    "inputs": "This is a test",
    "normalize": true, 
    "truncate": false
}

Leads to the following issue sometimes but will always lead in the container halting

2024-06-25 14:23:59 2024-06-25T19:23:59.648152Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:37: Model artifacts downloaded in 727.625µs
2024-06-25 14:23:59 2024-06-25T19:23:59.677067Z  INFO text_embeddings_router: router/src/lib.rs:204: Maximum number of tokens per request: 512
2024-06-25 14:23:59 2024-06-25T19:23:59.677539Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:23: Starting 8 tokenization workers
2024-06-25 14:23:59 2024-06-25T19:23:59.693679Z  INFO text_embeddings_router: router/src/lib.rs:229: Starting model backend
2024-06-25 14:23:59 2024-06-25T19:23:59.700277Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:115: Starting Bert model on Cpu
2024-06-25 14:23:59 2024-06-25T19:23:59.825617Z  WARN text_embeddings_router: router/src/lib.rs:283: Invalid hostname, defaulting to 0.0.0.0
2024-06-25 14:23:59 2024-06-25T19:23:59.828195Z  INFO text_embeddings_router: router/src/lib.rs:301: Ready
2024-06-25 14:26:00 2024-06-25T19:26:00.573739Z  INFO embed{total_time="47.706166ms" tokenization_time="16.247542ms" queue_time="3.980584ms" inference_time="23.678708ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:01 2024-06-25T19:26:01.020603Z  INFO embed{total_time="32.472708ms" tokenization_time="1.257125ms" queue_time="700.666µs" inference_time="30.334792ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:01 2024-06-25T19:26:01.532395Z  INFO embed{total_time="40.305334ms" tokenization_time="439.709µs" queue_time="552.875µs" inference_time="39.149292ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:02 2024-06-25T19:26:02.476522Z  INFO embed{total_time="32.864458ms" tokenization_time="594.542µs" queue_time="651.625µs" inference_time="31.527375ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:02 2024-06-25T19:26:02.991896Z  INFO embed{total_time="28.791833ms" tokenization_time="991.333µs" queue_time="624.334µs" inference_time="27.054083ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:03 2024-06-25T19:26:03.455926Z  INFO embed{total_time="39.955042ms" tokenization_time="1.131834ms" queue_time="1.376333ms" inference_time="37.33325ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:03 2024-06-25T19:26:03.889828Z  INFO embed{total_time="31.271209ms" tokenization_time="442.542µs" queue_time="469.625µs" inference_time="30.199875ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:04 2024-06-25T19:26:04.356643Z  INFO embed{total_time="30.665667ms" tokenization_time="477.958µs" queue_time="631.541µs" inference_time="29.433459ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:04 2024-06-25T19:26:04.803709Z  INFO embed{total_time="29.823917ms" tokenization_time="346.958µs" queue_time="485.416µs" inference_time="28.728042ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:05 2024-06-25T19:26:05.236581Z  INFO embed{total_time="32.747833ms" tokenization_time="745.5µs" queue_time="760.126µs" inference_time="31.091416ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:05 2024-06-25T19:26:05.676689Z  INFO embed{total_time="42.911125ms" tokenization_time="955.167µs" queue_time="761µs" inference_time="41.032292ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:06 2024-06-25T19:26:06.094651Z  INFO embed{total_time="37.192875ms" tokenization_time="949.584µs" queue_time="539.5µs" inference_time="35.582292ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success
2024-06-25 14:26:06 2024-06-25T19:26:06.391581Z  INFO embed{total_time="30.105625ms" tokenization_time="479.958µs" queue_time="765.043µs" inference_time="28.770541ms"}: text_embeddings_router::http::server: router/src/http/server.rs:587: Success

Occasionally this error will happen (although it is harder to trigger)

2024-06-25 14:19:46 thread '<unnamed>' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.1/src/job.rs:102:32:
2024-06-25 14:19:46 called `Option::unwrap()` on a `None` value
2024-06-25 14:19:46 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Why does running multiple embed requests rapidly cause a consistent crash in Docker?

Information

Tasks

Reproduction

Step 1:

Sample body

{
    "inputs": "This is a test",
    "normalize": true, 
    "truncate": false
}

Step 2:

Expected behavior

Server should not crash or hang when receiving rapid requests.

ErikKaum commented 5 days ago

Hi @liltimtim 👋

Unfortunately, I wasn't able to reproduce this on my machine. When send 5000 requests every 1ms or so I get the "error":"Model is overloaded", but not the panic.

Have you gotten this error on later versions as well? Like text-embeddings-inference:cpu-1.3?