Open KaniaruIvy opened 3 months ago
The endless loop issue you're encountering with streaming=True
in LlamaIndex on SageMaker might be due to the absence of a clear termination condition in the streaming logic. Specifically, the generator function that reads from the response stream lacks a mechanism to determine when to stop yielding new responses. This can happen if the endpoint continuously sends data without an explicit end-of-text token or a specific message indicating the end of the stream.
To address this, you'll need to implement a condition within the generator to break out of the loop when a certain criterion is met. This could involve checking for a specific pattern in the received data that signifies the end of the stream. Without such a condition, the generator will continue to wait for new items indefinitely, leading to the endless loop behavior you've observed.
Ensure your software is up-to-date, although no specific changes between version 0.10.17 and the latest version directly point to streaming functionality improvements. It's always beneficial to use the latest version for bug fixes and enhancements.
Your setup with SagemakerEndpoint
and ContentHandlerForTextGeneration
seems correct, but the exact cause of the issue might be tied to the specifics of these implementations. Reviewing the streaming logic for a proper termination signal based on the content of the data received is a good starting point for troubleshooting this issue.
Bug Description
I am using LlamaIndex and Llama 2-Chat on Sagemaker. I am able to make inferences successfully when streaming=False, but when streaming=True, the invocation enters an endless loop. I have been following the documentation below: https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/streaming/
Version
0.10.17
Steps to Reproduce
Here is where I define the Content handler
class ContentHandlerForTextGeneration(LLMContentHandler): content_type = "application/json" accepts = "application/json"
Defining the LLM
def define_llm(): parameters = { "max_new_tokens": 1024, "temperature": 0.1, } content_handler = ContentHandlerForTextGeneration()
llm = SagemakerEndpoint( endpoint_name=endpoint_name, region_name=region, model_kwargs=parameters, endpoint_kwargs={"CustomAttributes":"accept_eula=true"}, content_handler=content_handler ) return llm
llm = define_llm()
Settings.llm = llm Settings.embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-small-en-v1.5" )
Defining the index
index = VectorStoreIndex.from_documents( docs, show_progress=True )
Define query engine and query
(This cell runs enters and endless loop) query_engine = index.as_query_engine(streaming=True,similarity_top_k=3) streaming_response = query_engine.query("Give me a summary of the document") streaming_response.print_response_stream()
Relevant Logs/Tracbacks
No response