turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

Changing hyper-parameters after initilization without reloading weights from disk. #299

Open kmccleary3301 opened 9 months ago

kmccleary3301 commented 9 months ago

I'm writing a production server to handle requests from a large number of clients rotating. I have a custom manager class that handles everything, but I'm hoping to keep the models persistent in memory between requests. I'm trying to build so requests can specify hyper-parameters such as max_seq_len, temperature, etc. I'd prefer to do this as efficiently as possible and swap out custom parameters for each client request, as opposed to fully reloading the model from disk on every call with unique parameters.

Is there a way I can do this with the current code? If not, what would I need to refactor? I am working with jllllll's python package fork of this repo, but the changes are minimal, so I figured it appropriate to ask this question here.