theroyallab / tabbyAPI

An OAI compatible exllamav2 API that's both lightweight and fast
GNU Affero General Public License v3.0
503 stars 67 forks source link

[REQUEST] Automatic Model Unloading while idling #216

Open TetrisBlack opened 2 days ago

TetrisBlack commented 2 days ago

Problem

The model stays loaded onto VRAM, even after long time of idling. This causes higher Watt usage of the GPU and takes up VRAM that could be used by other programs.

Solution

Add an auto unload on idle function. Possible options. IDLE_UNLOAD = true (enables / disables this feature) IDLE_TIME = 5m (sets the timeout timer) In this example, after 5 min of not receiving a request, the model gets unloaded. After receiving a new request while the model is unloaded. The model should be then loaded into VRAM again and the 5-min timer start's from the beginning.

Alternatives

No response

Explanation

Would reduce the electric bill on 24/7 operation :)

Examples

https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion the keep_alive parameter on ollama

Additional context

No response

Acknowledgements

atisharma commented 2 days ago

I hope it would be optional if implemented. Loading Mistral Large takes a long time.

SecretiveShell commented 1 day ago

This would require having inline model loading enabled in the config