Here is where I define the Content handler

class ContentHandlerForTextGeneration(LLMContentHandler): content_type = "application/json" accepts = "application/json"

def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
    system_content = """You are a helpful assistant. Always answer to questions as helpfully as possible."""
    instructions = [
        { "role": "system","content": f"{system_content} "},
    ]

    instructions.append({"role": "user", "content": f"{prompt}"})
    stop_token = "</s>"
    start_token = "<s>"
    startPrompt = f"{start_token}[INST] "
    endPrompt = " [/INST]"
    conversation = []
    for index, instruction in enumerate(instructions):
        if instruction["role"] == "system" and index == 0:
            conversation.append(f"<<SYS>>\n{instruction['content']}\n<</SYS>>\n\n")
        elif instruction["role"] == "user":
            conversation.append(instruction["content"].strip())
        else:
            conversation.append(f"{endPrompt} {instruction['content'].strip()} {stop_token}{startPrompt}")

    input_str= startPrompt + "".join(conversation) + endPrompt
    inference_params = {

        "stop": ["</s>"],
        "return_full_text": False
    }

    payload = {
        "inputs":  prompt,
        "parameters": inference_params,
    }
    input_str=json.dumps(payload)
    return input_str.encode('utf-8')

def transform_output(self, output: bytes) -> str:
    response_json = json.loads(output.read().decode("utf-8"))
    return response_json[0]["generated_text"]

Defining the LLM

def define_llm(): parameters = { "max_new_tokens": 1024, "temperature": 0.1, } content_handler = ContentHandlerForTextGeneration()
llm = SagemakerEndpoint( endpoint_name=endpoint_name, region_name=region, model_kwargs=parameters, endpoint_kwargs={"CustomAttributes":"accept_eula=true"}, content_handler=content_handler ) return llm

llm = define_llm()

Settings.llm = llm Settings.embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-small-en-v1.5" )

Defining the index

index = VectorStoreIndex.from_documents( docs, show_progress=True )

Define query engine and query

(This cell runs enters and endless loop) query_engine = index.as_query_engine(streaming=True,similarity_top_k=3) streaming_response = query_engine.query("Give me a summary of the document") streaming_response.print_response_stream()

Relevant Logs/Tracbacks

No response

The endless loop issue you're encountering with streaming=True in LlamaIndex on SageMaker might be due to the absence of a clear termination condition in the streaming logic. Specifically, the generator function that reads from the response stream lacks a mechanism to determine when to stop yielding new responses. This can happen if the endpoint continuously sends data without an explicit end-of-text token or a specific message indicating the end of the stream.

To address this, you'll need to implement a condition within the generator to break out of the loop when a certain criterion is met. This could involve checking for a specific pattern in the received data that signifies the end of the stream. Without such a condition, the generator will continue to wait for new items indefinitely, leading to the endless loop behavior you've observed.

Ensure your software is up-to-date, although no specific changes between version 0.10.17 and the latest version directly point to streaming functionality improvements. It's always beneficial to use the latest version for bug fixes and enhancements.

Your setup with SagemakerEndpoint and ContentHandlerForTextGeneration seems correct, but the exact cause of the issue might be tied to the specifics of these implementations. Reviewing the streaming logic for a proper termination signal based on the content of the data received is a good starting point for troubleshooting this issue.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

run-llama / llama_index

[Bug]: Response Streaming Using Llama Index enters an endeless loop #12886

Bug Description

Version

Steps to Reproduce