Open novak2000 opened 9 months ago
@novak2000 i'm too having this issue, I tried on my laptop in CPU when i try to use Embed4all or SentenceTransformer, both had huge increase in memory after each request, kindly let me know if you found any solutions
It seems its linked to an issue with Hyper: https://github.com/hyperium/hyper/issues/1790
I seem to still be running into this issue of steadily growing mem usage of the text-embeddings-router process. This is with TEI 1.1.0, but I also tested 1.0.0 with same results.
Running with docker:
docker run --name tei --gpus all -e CUDA_MEMORY_FRACTION=1.0 -p 8081:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.1.0 --model-id $model --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls
model is: BAAI/bge-small-en-v1.5
Do you have a graph of the memory increase? And if you have v1.0.0 vs v1.1.0 that would be amazing.
I don't have a pretty graph, but here are 3 ps outputs over the past 24 hrs, the first one is from just after starting the docker image, the 2nd is from not long after, and the 3rd is from a minute ago where you can see the mem percentage has grown to 8.2% of the server's ram, from the first output's 3.6%
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 3341091 12.0 3.6 54668960 583220 ? Ssl 18:48 0:01 text-embeddings-router --model-id BAAI/bge-small-en-v1.5 --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls
root 3341091 2.5 3.9 54762532 638616 ? Ssl 18:48 0:51 text-embeddings-router --model-id BAAI/bge-small-en-v1.5 --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls
root 3341091 65.2 8.2 55811112 1338148 ? Ssl Mar06 594:48 text-embeddings-router --model-id BAAI/bge-small-en-v1.5 --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls
I'm embedding with default config with 1 milion vector without any issue, maybe worker cause leak?
BTW I forgot to mention regarding 1.0.0 vs 1.1.0, I tried both and performance seemed similar in regards to the mem use growing.
The worker/client program in my case is on a separate server, and the embedding throughput it's inferencing is in the range of 5 to 10 million requests to the TEI server per 24 hrs.
Ok I will keep this in my prio list but it must be very deep in the stack and might take some time to find.
The worker/client program in my case is on a separate server, and the embedding throughput it's inferencing is in the range of 5 to 10 million requests to the TEI server per 24 hrs.
That's great :) It's always nice to hear that the project is running in prod with some real throughput requirements.
Ok if you need any other details let me know. The instance is on AWS, a g5.xlarge (1 nvidia A10G gpu), using the AMI:
Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04) 20231103 id: ami-0ac1f653c5b6af751
the gpu is being shared, 90% of it is a separate vllm text generation server, the other 10% gets used by TEI.
Just to mention that I'm also running into the same issue again. I'm using version 1.0
@novak2000 can you use 1.1 and keep the memory resources limit? I'm wondering if the container will be killed or not on 1.1.
I'm sending you docker stats before and after running a simple test, with around 25k requests to the server(each of the requests has between 100 and 1000 texts to embed and ~1000 texts to rerank)
models used: reranker: BAAI/bge-reranker-base embedding: sentence-transformers/multi-qa-MiniLM-L6-cos-v1
before:
after ~10k requests(it looked like they were working stable just beneath the memory limit):
after ~20k requests, the embedding server got killed and restarted on failure:
Let me know if you need more details
I ran the tests again, and this time both services were killed
graph of memory consumption:
Ok thanks for this info. I'm more or less off this week so I will keep digging when I find the time.
Any news on this ? I'm running into the same issue and it's not usable in production.
Yes it seems that there was a leak in one of our dependency. This is othogonal to the problem of allocator reported above. We updated the depency and added a logic to trim os pages in #307.
See: https://www.algolia.com/blog/engineering/when-allocators-are-hoarding-your-precious-memory/ for more info on the subject.
I will release 1.3 with this PR today. Will you be able to test it and report if the problem is indeed fixed?
I can try it today if you want but I don't see the 1.3 release for the moment.
It's released now.
@djanito, were you able to try it out?
System Info
I'm running a docker container to run BAAI rerank-base model on a local PC with RTX 4090 and intel i9-13900KF, 64GB RAM
Information
Tasks
Reproduction
After calling '/rerank' request many times (around 400,000 times with 5000 texts each) RAM memory usage increases significantly(from 6GB to 42+GB). Memory usage before: after:
Expected behavior
Is this behavior expected? Since I'm unfamiliar with rust and its basic concepts, any feedback would be helpful. Thanks!