[Question]: Inconsistency among complete, chat, stream_complete, stream_chat?

chigkim commented 8 months ago

Question Validation

[x] I have searched both the documentation and discord for an answer.

Question

With OpenAI, complete and chat return usage. However stream_chat and stream_complete do not. With Ollama, complete, chat, and stream_complete return things like total_duration, prompt_eval_count. However, stream_chat does not.

OpenAI No stream chat: ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content="Ahoy matey! The name's Captain Rainbowbeard, the most colorful pirate on the seven seas! What can I do for ye today?", additional_kwargs={}), raw={'id': 'chatcmpl-8u54QByCw9wg2ZnG3S9zUJO40Y0T8', 'choices': [Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Ahoy matey! The name's Captain Rainbowbeard, the most colorful pirate on the seven seas! What can I do for ye today?", role='assistant', function_call=None, tool_calls=None))], 'created': 1708376550, 'model': 'gpt-3.5-turbo-0125', 'object': 'chat.completion', 'system_fingerprint': 'fp_6dd124df95', 'usage': CompletionUsage(completion_tokens=30, prompt_tokens=23, total_tokens=53)}, delta=None, additional_kwargs={})

OpenAI stream_chat: ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content="Ahoy matey! The name's Captain Rainbowbeard! Aye, I be a pirate with a love for all things colorful and bright. Me beard be as vibrant as a rainbow, and me ship be the most colorful vessel on the seven seas! What can I do for ye today, me hearty?", additional_kwargs={}), raw={'id': 'chatcmpl-8u559v4ifr04tuzbTJvNqlFdef6fe', 'choices': [Choice(delta=ChoiceDelta(content=None, function_call=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)], 'created': 1708376595, 'model': 'gpt-3.5-turbo-0125', 'object': 'chat.completion.chunk', 'system_fingerprint': 'fp_69829325d0'}, delta='', additional_kwargs={})

Ollama No stream chat: ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=" My name is Captain Parrot Beak.\n\nI'm an eccentric and jolly pirate who loves to live life on the seas, singing sea shanties, telling tall tales, and swabbing my wooden leg. You could say I'm a bit of a buccaneer with flair! Join me in adventures filled with gold, laughter, and a touch of the paranormal. Just remember, we're here to have fun. After all, life is better with pirates in it! Arrrr, matey!", additional_kwargs={}), raw={'model': 'neural-chat', 'created_at': '2024-02-19T20:57:22.330826Z', 'message': {'role': 'assistant', 'content': " My name is Captain Parrot Beak.\n\nI'm an eccentric and jolly pirate who loves to live life on the seas, singing sea shanties, telling tall tales, and swabbing my wooden leg. You could say I'm a bit of a buccaneer with flair! Join me in adventures filled with gold, laughter, and a touch of the paranormal. Just remember, we're here to have fun. After all, life is better with pirates in it! Arrrr, matey!"}, 'done': True, 'total_duration': 20255928348, 'load_duration': 2862670561, 'prompt_eval_count': 31, 'prompt_eval_duration': 1548238000, 'eval_count': 117, 'eval_duration': 15844499000}, delta=None, additional_kwargs={'model': 'neural-chat', 'created_at': '2024-02-19T20:57:22.330826Z', 'done': True, 'total_duration': 20255928348, 'load_duration': 2862670561, 'prompt_eval_count': 31, 'prompt_eval_duration': 1548238000, 'eval_count': 117, 'eval_duration': 15844499000})

Ollama Stream chat: ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=" My name is Captain Parrot Beard. I'm known for my love of rum, parrots and walking the plank. But don't let my light-hearted appearance fool you; I'm a fierce pirate leader with great strategic skills. Arrrr!", additional_kwargs={}), raw={'model': 'neural-chat', 'created_at': '2024-02-19T15:30:20.822078Z', 'message': {'role': 'assistant', 'content': '!'}, 'done': False}, delta='!', additional_kwargs={'model': 'neural-chat', 'created_at': '2024-02-19T15:30:20.822078Z', 'done': False})

dosubot[bot] commented 8 months ago

🤖

Hey @chigkim, great to see you back in the world of code and pirates! 🦜

The inconsistency you're observing in the return values of complete, chat, stream_complete, and stream_chat between OpenAI and Ollama is due to the different ways these methods are implemented in the respective libraries.

In OpenAI, the methods complete and chat return usage information because they are designed to interact with the OpenAI API and return the response from the API, which includes the generated text and other information about the API request. However, stream_chat and stream_complete do not return usage information because they are designed to return a stream of responses from the API, and the usage information is not included in each individual response in the stream.

In Ollama, the methods complete, chat, and stream_complete return additional information such as total_duration, prompt_eval_count, etc., because they are implemented in a way that processes the response from the API and extracts this additional information. However, stream_chat does not return this additional information because it is designed to return a stream of responses from the API, and the additional information is not included in each individual response in the stream.

This is not an inconsistency in the LlamaIndex codebase, but rather a difference in the design and implementation of these methods in the OpenAI and Ollama libraries. If you need the additional information returned by the complete, chat, and stream_complete methods in Ollama for stream_chat, you may need to modify the stream_chat method in the Ollama library to extract and return this information.

For more details on how these methods are implemented in LlamaIndex, you can refer to the source code in the LlamaIndex repository. Specifically, you can look at the Cohere class in the llama_index/llms/cohere/base.py file, which implements the complete, chat, stream_complete, and stream_chat methods.

I hope this helps clarify the difference in the return values of these methods between OpenAI and Ollama. If you have any further questions, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

chigkim commented 8 months ago

Just commenting on @dosu-bot response which is partially correct. OpenAI library doesn't seem to return token usage when streaming. However Ollama returns total_duration, prompt_eval_count, etc. when using complete and chat api with streaming. However, llamaindex returns the info when using stream_complete, but not stream_chat for some reason.

dosubot[bot] commented 8 months ago

🤖

The inconsistencies you're observing between the OpenAI and Ollama libraries are likely due to the different design and implementation of these libraries. Each library is designed with different use cases in mind, and thus they may provide different methods and return different types of information.

To modify the stream_chat method in the Ollama library to return additional information like total_duration and prompt_eval_count, similar to the stream_complete method in the LlamaIndex, you can add these fields to the ChatResponse object that is yielded by the stream_chat method. However, this would require that the Ollama API server returns this information in its response. If the Ollama API server does not return this information, you would need to modify the server to do so.

Here is an example of how you might modify the stream_chat method:

@llm_chat_callback()
def stream_chat(
    self, messages: Sequence[ChatMessage], **kwargs: Any
) -> ChatResponseGen:
    payload = {
        "model": self.model,
        "messages": [
            {
                "role": message.role.value,
                "content": message.content,
                **message.additional_kwargs,
            }
            for message in messages
        ],
        "options": self._model_kwargs,
        "stream": True,
        **kwargs,
    }
with httpx.Client(timeout=Timeout(self.request_timeout)) as client:
        with client.stream(
            method="POST",
            url=f"{self.base_url}/api/chat",
            json=payload,
        ) as response:
            response.raise_for_status()
            text = ""
            for line in response.iter_lines():
                if line:
                    chunk = json.loads(line)
                    if "done" in chunk and chunk["done"]:
                        break
                    message = chunk["message"]
                    delta = message.get("content")
                    text += delta
                    yield ChatResponse(
                        message=ChatMessage(
                            content=text,
                            role=MessageRole(message.get("role")),
                            additional_kwargs=get_addtional_kwargs(
                                message, ("content", "role")
                            ),
                        ),
                        delta=delta,
                        raw=chunk,
                        additional_kwargs=get_addtional_kwargs(chunk, ("message",)),
                        total_duration=chunk.get("total_duration"),  # new field
                        prompt_eval_count=chunk.get("prompt_eval_count"),  # new field
                    )

In this modification, the ChatResponse object that is yielded by the stream_chat method includes two new fields: total_duration and prompt_eval_count. These fields are populated with the corresponding values from the chunk dictionary, which is created by parsing each line of the response from the Ollama API server. If the server does not return these values, chunk.get("total_duration") and chunk.get("prompt_eval_count") will return None.

Please note that this modification assumes that the Ollama API server returns total_duration and prompt_eval_count in its response. If it does not, you would need to modify the server to do so.

You can find the current structure of the stream_chat method in the LlamaIndex codebase in the llama-index-core/llama_index/core/chat_engine/context.py file.

I hope this helps! If you have any further questions, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

chigkim commented 8 months ago

It worked if I comment out the following on line 150 and 151 for def stream_chat in llama_index/llms/ollama/base.py. I was able to get all the stats in response.raw. I wonder why this was intentionally opt out? If there's no harm, could you fix it?

if "done" in chunk and chunk["done"]:
    break

dosubot[bot] commented 5 months ago

Hi, @chigkim,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, the issue you raised highlighted inconsistencies in the responses returned by different methods in the OpenAI and Llama APIs. Dosubot provided insights into the differences and suggested modifying the stream_chat method in the Llama library to return additional information. You confirmed that commenting out certain lines in the stream_chat method allowed you to retrieve all the stats in the response and questioned why this was intentionally opted out.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

chigkim commented 5 months ago

As far as I know, it's still the issue.

run-llama / llama_index