run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.38k stars 4.67k forks source link

[Bug]: Interrupted OpenAiAgent Streaming using K8s and NGINX #10478

Closed thanh-cnguyen closed 1 month ago

thanh-cnguyen commented 5 months ago

Bug Description

I need help with the response streaming issue. It was working well with docker containers. However, when streaming the responses using K8s pods, the response is returned partially. Specifically, it outputs a word or two and then stops. This issue didn't occur when I tried using Docker containers. I would like to know if the llama index has conflicts with NGINX or K8s and if there is a way to debug this issue.

Version

0.9.39

Steps to Reproduce

The steps to reproduce:

  1. Use chat_stream
  2. Pass the StreamingAgentChatResponse.response_gen to the StreamingHttpResponse
  3. Return it to the front end This is how I initialized my agent:

How the agent is initialized

""" OpenAIAgent.from_tools( tools=query_engine_tools, llm=llm, memory=memory, callback_manager=callback_manager, system_prompt=updated_default_system_prompt, verbose=False, ) """

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 5 months ago

🤖

Hey @thanh-cnguyen, great to see you back here! Hope you've been doing well. Let's dive into this issue and see how we can help.

Based on the information provided, it's difficult to pinpoint the exact cause of the issue. However, I can suggest a few general debugging steps that might help you identify the problem.

  1. Check the NGINX configuration: NGINX might be configured in a way that it's closing the connection prematurely or not handling the streaming correctly. Make sure that the NGINX is properly configured to handle gRPC streaming.

  2. Check the K8s pod logs: The logs of the K8s pods might provide more information about why the streaming is returning partially. You can use kubectl logs <pod-name> to check the logs.

  3. Check the Triton Inference Server logs: The Triton Inference Server might be logging errors or warnings that can help identify the issue. You can check the logs of the Triton Inference Server for more information.

  4. Check the network between the K8s pods and NGINX: There might be network issues between the K8s pods and NGINX that are causing the streaming to return partially. You can use network debugging tools to check for network issues.

  5. Check the LlamaIndex code: There might be a bug in the LlamaIndex code that's causing the streaming to return partially. You can use a Python debugger to step through the code and identify any issues.

As for your question about known issues in the LlamaIndex version 0.9.39 that could cause streaming to return partially when used with K8s pods and NGINX, I wasn't able to find any such issues in the repository. However, it's possible that there might be an issue that I missed or that the issue is not documented in the repository.

If these steps don't help you identify the issue, could you please provide more information about the error or behavior you're experiencing? Specifically, any error messages or logs would be very helpful. Also, could you please provide more details about your NGINX configuration and how you're using K8s pods and NGINX with LlamaIndex? This information could help me provide a more specific solution to your problem.

For more information, you can refer to the llama_index/llms/nvidia_triton_utils.py file in the LlamaIndex repository.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 5 months ago

Yea I've never seen this before? You might have to debug this somehow? I wonder if its related to how the response is being consumed on the frontend?

dosubot[bot] commented 2 months ago

Hi, @thanh-cnguyen,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you are experiencing interrupted response streaming when using K8s pods and NGINX, which did not occur with Docker containers. The issue involves seeking assistance to determine if there are conflicts with NGINX or K8s and how to debug the issue, with the OpenAIAgent version being 0.9.39. Initial responses from users "dosubot" and "logan-markewich" offered general debugging steps and suggested potential areas to investigate, such as NGINX configuration, K8s pod logs, Triton Inference Server logs, network issues, and LlamaIndex code. You acknowledged these responses with a "THUMBS_UP" reaction.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and cooperation. If you have any further questions or updates, feel free to reach out.