Open UTSAV-44 opened 4 days ago
I suppose you could draw filters from a pool, but you'd have to explicitly reset them since they are stateful objects. Support for that is going to be up to LMFE, and I'm not sure if resetting fully clears all the internal state they might have. There's obviously something being retained but you'd have to dig into the source code of LMFE to try and figure out what it is.
OS
Linux
GPU Library
CUDA 12.x
Python version
3.11
Pytorch version
2.4.1
Model
No response
Describe the bug
For enforcing model to give response in json format, I am using ExLlamaV2TokenEnforcerFilter and ExLlamaV2PrefixFilter classes and appending to to filters list and passing as filters for generating output from model. As my usecase are limited so ,I thought of caching these both class by storing it in a dict and reusing it. But by doing this I observed that system ram utilization is increasing and after few iterations it leads to Out of Memory. Usually it takes 10-15 GB of system RAM but overtime the memory usage goes over 128 GB causing OOM.
I am sharing the code snippet