Open John42506176Linux opened 8 hours ago
@John42506176Linux Can you try it on a L4 GPU. Or lower the batch size?
Your instance is a T4 with 16GB of RAM, right?
I am not sure how good it runs the custom modeling code. T4 might not have enough sram on chip to perform this piece of code efficently. https://huggingface.co/Alibaba-NLP/new-impl/blob/40ced75c3017eb27626c9d4ea981bde21a2662f4/modeling.py#L579
I'll check it and get back to you, but I'm curious why this would occur with infinity, but not with the reranker module.
Which reranker module? Whats your match size?
https://github.com/AnswerDotAI/rerankers?tab=readme-ov-file They use a default of 16, but I used 32 for equivalency's sake. Here's code gte_ranker = Reranker("Alibaba-NLP/gte-multilingual-reranker-base", model_type = "cross-encoder",batch_size=32) gte_results = gte_ranker.rank(query=query,docs=docs,doc_ids=ids) @michaelfeil
System Info
Command used port=7997 gte_rerank_model=Alibaba-NLP/gte-multilingual-reranker-base volume=$PWD/data
sudo docker run -it --gpus all \ -v $volume:/app/.cache \ -p $port:$port \ michaelf34/infinity:latest \ v2 \ --batch-size 32 \ --model-id $gte_rerank_model \ --port $port
Device: AWS EC2 G4DN Model: Alibaba-NLP/gte-multilingual-reranker-base Nvidia Information: NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2
Information
Tasks
Reproduction
1.Run the docker container using the command above.
Expected behavior
Smooth reranking.