Closed krahnikblis closed 1 month ago
Hi @krahnikblis, thanks for sharing. Unfortunately, I cannot accept change requests other than pull requests, but I can take a look at your PR(s) once submitted. Thanks!
Hey @krahnikblis, quick update: I have extended the LimitUsage plugin in the main branch. You can now also configure things like:
max_tokens_per_minute_in_k:
gpt-35-turbo: 50
gpt-4-turbo: 5
in addition to just
max_tokens_per_minute_in_k: 20
I think that solves your issue. If not, please let me know. I will include the update in the next release.
hello! i've been getting this thing up and running on a VM between my team and apps and our Azure OAI service - so far it's working nicely! but, my resource groups and quotas and such mean i have wildly different token limits per model (5K/min on GPT-4 and 30K/min on GPT-3.5 and embedding models). so, i need for users to be able to be configured with limits per model. i made some adjustments to the
config.local.yaml
structure and theLimitUsage.py
file, and things appear to be working as desired, so i thought i'd share and request the feature be implemented so the next time i git-pull for your latest enhancements i won't need to re-edit the code? i don't yet know how to use github PR features so i'm pasting in the relevant bits here. there's definitely a more elegant way to do this, but there's also a lot of nesting and subclassing and i just wanted to get things moving so this is how i did it:in the
config.local.yaml
file, under each client, i added amodels
key, like so:leaving the existing
max_tokens_per_minute_in_k
means your structure is untouched and these changes would be backward-compatible with configs not having themodels
key.in
LimitUsage.py
, inside ofon_client_identified(self, routing_slip)
, i addedrouting_slip
to the call to the tokens-per-client function call:and then redefined the function itself to take that new parameter and get the model being used in the request, look that up against the
client_settings
which seamlessly populated themodels
list using your existingConfiguration
class:the changes are of course including the
routing_slip
parameter, and the section beginning withclient_models
- if the request has the model param (as it should) and if the client has themodels
key in the settings and that model exists in the client's specific limits, the function returns the model-specific limit, otherwise it returns the class' existingconfigured_max_tpms
for the client. **edit: made some changes to whereclient_settings
is collected andclient_models
is referencedi've also kept some notes on how i set this up on Docker (it was a challenge as i'm relatively new to it), happy to share them as a write-up, and i'm also working on a
LogUsageMessagesToJSON
plugin since i want our usage histories to be searchable for analysis and building a knowledge graph... would be happy to share that plugin as well if you're interested, once turn all the bugs into features...