ollama / ollama

Get up and running with Llama 3.1, Mistral, Gemma 2, and other large language models.
https://ollama.com
MIT License
89.29k stars 7k forks source link

Streaming Chat Completion via OpenAI API should support stream option to include Usage #4448

Open odrobnik opened 4 months ago

odrobnik commented 4 months ago

In streaming mode the OpenAI chat completion has a new parameter to include Usage information after the Chunks. You just add a { "include_usage": true } to the request.

Then the final chunks will look like this:

...
data: {"id":"chatcmpl-9P4UJf7DEdyXVro2VOMRMT9qKR0bC","object":"chat.completion.chunk","created":1715762479,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":1,"delta":{},"logprobs":null,"finish_reason":"stop"}],"usage":null}
data: {"id":"chatcmpl-9P4UJf7DEdyXVro2VOMRMT9qKR0bC","object":"chat.completion.chunk","created":1715762479,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":2,"delta":{},"logprobs":null,"finish_reason":"stop"}],"usage":null}
data: {"id":"chatcmpl-9P4UJf7DEdyXVro2VOMRMT9qKR0bC","object":"chat.completion.chunk","created":1715762479,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[],"usage":{"prompt_tokens":24,"completion_tokens":58,"total_tokens":82}}
data: [DONE]

The final chunk contains no choices, but a usage:

"usage":{"prompt_tokens":24,"completion_tokens":58,"total_tokens":82}

This usage is over all the generations from this stream.

jeremychone commented 1 month ago

I second this one. This is quite missing.

Not sure if we should use the native Ollama API rather than the OpenAI compatibility layer, as it seems to have the prompt_eval_count (input_tokens) and eval_count (output_tokens) in the final response.

I am okay with creating a custom adapter for Ollama with its native API, but not sure if that aligns with Ollama's focus or direction.

liamwh commented 1 week ago

Can this issue now be closed since this has been merged? https://github.com/lobehub/lobe-chat/issues/3179