timoklimmer / powerproxy-aoai

Monitors and processes traffic to and from Azure OpenAI endpoints.
MIT License
83 stars 23 forks source link

Use token consumption from streaming response (once available) #77

Open codylittle opened 1 month ago

codylittle commented 1 month ago

Currently none of the gpt-4-turbo variants are included, maybe a prefix based approach similar to that used in tiktoken would be best.

timoklimmer commented 1 month ago

Thanks, I will take a look into it.

timoklimmer commented 1 month ago

As I understand, including the gpt-4-turbo variants would not make a difference currently, but the function needs indeed to be updated once GPT-4o is available as API and once we finally get token counts in streaming responses. I let this issue open and will fix it in future.

codylittle commented 1 month ago

I'm fairly confident that the 4-turbo variants still utilize the same ChatML formatting as prior models, so no change required there. My query is more around that given "gpt-4-turbo" is not in the dictionary, nor is equal to "gpt-3.5-turbo-0301", it'll raise a NotImplementedError exception.

timoklimmer commented 1 week ago

Token estimation needs to be updated once we have the usage info in the response. On hold for now / accepting approximate estimation as workaround.

codylittle commented 1 week ago

Hey Timo, I assume this is in reference to OAI update? Interested to see how Azure will implement this for streams cut short from finish_reason: content_filter.

timoklimmer commented 1 week ago

@codylittle Yes, exactly.