michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models and clip
https://michaelfeil.github.io/infinity/
MIT License
1.25k stars 86 forks source link

release gpu memory #324

Open Myson850 opened 1 month ago

Myson850 commented 1 month ago

Feature request

Release gpu memory after a certain number of calls

Motivation

After setting the --batch-size of the embed model to 100, I tried to call the data with a batch size of 80. It succeeded, but then I called the data with a batch size of 10 many times. The GPU occupied the video memory and did not release or lower

Your contribution

.

michaelfeil commented 1 month ago

Set batch_size to e.g. 32 (multiple of 8 and pow 2 encouraged) for better usage. Sorry, but your request does not make much sense. YOu also have computation graphs in torch.compile that make your proposed feature very unattractive.