novak2000 commented 9 months ago

System Info

I'm running a docker container to run BAAI rerank-base model on a local PC with RTX 4090 and intel i9-13900KF, 64GB RAM Screenshot from 2024-02-11 16-15-22

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

After calling '/rerank' request many times (around 400,000 times with 5000 texts each) RAM memory usage increases significantly(from 6GB to 42+GB). Memory usage before: Screenshot from 2024-02-11 00-35-19 after: Screenshot from 2024-02-11 16-14-30

Expected behavior

Is this behavior expected? Since I'm unfamiliar with rust and its basic concepts, any feedback would be helpful. Thanks!

karan00713 commented 9 months ago

@novak2000 i'm too having this issue, I tried on my laptop in CPU when i try to use Embed4all or SentenceTransformer, both had huge increase in memory after each request, kindly let me know if you found any solutions

OlivierDehaene commented 8 months ago

It seems its linked to an issue with Hyper: https://github.com/hyperium/hyper/issues/1790

161 solves the issue by using another memory allocator.

A-Posthuman commented 8 months ago

I seem to still be running into this issue of steadily growing mem usage of the text-embeddings-router process. This is with TEI 1.1.0, but I also tested 1.0.0 with same results.

Running with docker:

docker run --name tei --gpus all -e CUDA_MEMORY_FRACTION=1.0 -p 8081:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.1.0 --model-id $model --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls

model is: BAAI/bge-small-en-v1.5

OlivierDehaene commented 8 months ago

Do you have a graph of the memory increase? And if you have v1.0.0 vs v1.1.0 that would be amazing.

A-Posthuman commented 8 months ago

I don't have a pretty graph, but here are 3 ps outputs over the past 24 hrs, the first one is from just after starting the docker image, the 2nd is from not long after, and the 3rd is from a minute ago where you can see the mem percentage has grown to 8.2% of the server's ram, from the first output's 3.6%

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     3341091 12.0  3.6 54668960 583220 ?     Ssl  18:48   0:01 text-embeddings-router --model-id BAAI/bge-small-en-v1.5 --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls

root     3341091  2.5  3.9 54762532 638616 ?     Ssl  18:48   0:51 text-embeddings-router --model-id BAAI/bge-small-en-v1.5 --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls

root     3341091 65.2  8.2 55811112 1338148 ?    Ssl  Mar06 594:48 text-embeddings-router --model-id BAAI/bge-small-en-v1.5 --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls

hiepxanh commented 8 months ago

I'm embedding with default config with 1 milion vector without any issue, maybe worker cause leak?

A-Posthuman commented 8 months ago

BTW I forgot to mention regarding 1.0.0 vs 1.1.0, I tried both and performance seemed similar in regards to the mem use growing.

The worker/client program in my case is on a separate server, and the embedding throughput it's inferencing is in the range of 5 to 10 million requests to the TEI server per 24 hrs.

OlivierDehaene commented 8 months ago

Ok I will keep this in my prio list but it must be very deep in the stack and might take some time to find.

The worker/client program in my case is on a separate server, and the embedding throughput it's inferencing is in the range of 5 to 10 million requests to the TEI server per 24 hrs.

That's great :) It's always nice to hear that the project is running in prod with some real throughput requirements.

A-Posthuman commented 8 months ago

Ok if you need any other details let me know. The instance is on AWS, a g5.xlarge (1 nvidia A10G gpu), using the AMI:

Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04) 20231103 id: ami-0ac1f653c5b6af751

the gpu is being shared, 90% of it is a separate vllm text generation server, the other 10% gets used by TEI.

novak2000 commented 8 months ago

Just to mention that I'm also running into the same issue again. I'm using version 1.0

OlivierDehaene commented 8 months ago

@novak2000 can you use 1.1 and keep the memory resources limit? I'm wondering if the container will be killed or not on 1.1.

novak2000 commented 8 months ago

I'm sending you docker stats before and after running a simple test, with around 25k requests to the server(each of the requests has between 100 and 1000 texts to embed and ~1000 texts to rerank)

models used: reranker: BAAI/bge-reranker-base embedding: sentence-transformers/multi-qa-MiniLM-L6-cos-v1

before:

after ~10k requests(it looked like they were working stable just beneath the memory limit):

after ~20k requests, the embedding server got killed and restarted on failure:

Let me know if you need more details

novak2000 commented 8 months ago

I ran the tests again, and this time both services were killed

graph of memory consumption:

OlivierDehaene commented 8 months ago

Ok thanks for this info. I'm more or less off this week so I will keep digging when I find the time.

djanito commented 4 months ago

Any news on this ? I'm running into the same issue and it's not usable in production.

OlivierDehaene commented 4 months ago

Yes it seems that there was a leak in one of our dependency. This is othogonal to the problem of allocator reported above. We updated the depency and added a logic to trim os pages in #307.

See: https://www.algolia.com/blog/engineering/when-allocators-are-hoarding-your-precious-memory/ for more info on the subject.

I will release 1.3 with this PR today. Will you be able to test it and report if the problem is indeed fixed?

djanito commented 4 months ago

I can try it today if you want but I don't see the 1.3 release for the moment.

OlivierDehaene commented 4 months ago

It's released now.

OlivierDehaene commented 4 months ago

@djanito, were you able to try it out?

huggingface / text-embeddings-inference

Potential memory leak #156

System Info

Information

Tasks

Reproduction

Expected behavior

161 solves the issue by using another memory allocator.