michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models and clip
https://michaelfeil.github.io/infinity/
MIT License
1.27k stars 89 forks source link

Memory allocation error with Alibaba-NLP/gte-multilingual-reranker-base #370

Open John42506176Linux opened 8 hours ago

John42506176Linux commented 8 hours ago

System Info

Command used port=7997 gte_rerank_model=Alibaba-NLP/gte-multilingual-reranker-base volume=$PWD/data

sudo docker run -it --gpus all \ -v $volume:/app/.cache \ -p $port:$port \ michaelf34/infinity:latest \ v2 \ --batch-size 32 \ --model-id $gte_rerank_model \ --port $port

Device: AWS EC2 G4DN Model: Alibaba-NLP/gte-multilingual-reranker-base Nvidia Information: NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2

Information

Tasks

Reproduction

1.Run the docker container using the command above.

  1. Feed > 1K documents to the reranker(in batches).
  2. See Following error
  3. ERROR 2024-09-19 16:13:18,115 infinity_emb ERROR: CUDA out of memory. batch_handler.py:47 Tried to allocate 1.49 GiB. GPU 0 has a total capacity of 14.58 GiB of which 597.62 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 3.94 GiB is allocated by PyTorch, and 953.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting Additional Info: Using the model through answerdotai/reranker does not have this issue.

Expected behavior

Smooth reranking.

michaelfeil commented 5 hours ago

@John42506176Linux Can you try it on a L4 GPU. Or lower the batch size?

Your instance is a T4 with 16GB of RAM, right?

I am not sure how good it runs the custom modeling code. T4 might not have enough sram on chip to perform this piece of code efficently. https://huggingface.co/Alibaba-NLP/new-impl/blob/40ced75c3017eb27626c9d4ea981bde21a2662f4/modeling.py#L579

John42506176Linux commented 5 hours ago

I'll check it and get back to you, but I'm curious why this would occur with infinity, but not with the reranker module.

michaelfeil commented 2 hours ago

Which reranker module? Whats your match size?

John42506176Linux commented 2 hours ago

https://github.com/AnswerDotAI/rerankers?tab=readme-ov-file They use a default of 16, but I used 32 for equivalency's sake. Here's code gte_ranker = Reranker("Alibaba-NLP/gte-multilingual-reranker-base", model_type = "cross-encoder",batch_size=32) gte_results = gte_ranker.rank(query=query,docs=docs,doc_ids=ids) @michaelfeil