In my project I try to 'guess' the memory that the model + context is using with some napkin calculation (see screenshot). But I was thinking: would it be possible to get more accurate information about memory use? Perhaps if all the debug info was available for parsing with a callback, or perhaps even as an object that was returned/updated after loading/inference?
This would allow me to more accurately find out if there is enough free memory to, for example, load the speech recognition and TTS processes (and other 'small stuff' like translation, OCR) without having to unload the main LLM.
In my project I try to 'guess' the memory that the model + context is using with some napkin calculation (see screenshot). But I was thinking: would it be possible to get more accurate information about memory use? Perhaps if all the debug info was available for parsing with a callback, or perhaps even as an object that was returned/updated after loading/inference?
This would allow me to more accurately find out if there is enough free memory to, for example, load the speech recognition and TTS processes (and other 'small stuff' like translation, OCR) without having to unload the main LLM.