[REQUEST] Automatic Model Unloading while idling

TetrisBlack commented 2 days ago

Problem

The model stays loaded onto VRAM, even after long time of idling. This causes higher Watt usage of the GPU and takes up VRAM that could be used by other programs.

Solution

Add an auto unload on idle function. Possible options. IDLE_UNLOAD = true (enables / disables this feature) IDLE_TIME = 5m (sets the timeout timer) In this example, after 5 min of not receiving a request, the model gets unloaded. After receiving a new request while the model is unloaded. The model should be then loaded into VRAM again and the 5-min timer start's from the beginning.

Alternatives

No response

Explanation

Would reduce the electric bill on 24/7 operation :)

Examples

https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion the keep_alive parameter on ollama

Additional context

No response

Acknowledgements

[X] I have looked for similar requests before submitting this one.
[X] I understand that the developers have lives and my issue will be answered when possible.
[X] I understand the developers of this program are human, and I will make my requests politely.

atisharma commented 2 days ago

I hope it would be optional if implemented. Loading Mistral Large takes a long time.

SecretiveShell commented 1 day ago

This would require having inline model loading enabled in the config

theroyallab / tabbyAPI