run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.78k stars 5.27k forks source link

[Question]: Retrieving Log Probabilities in Streaming Response Using SimpleChatEngine #16264

Open Sanjit0910 opened 1 month ago

Sanjit0910 commented 1 month ago

Question Validation

Question

I am using SimpleChatEngine to query a language model (LLM) hosted on a vLLM server. I aim to retrieve log probabilities (logprobs) for the tokens generated by the model. To achieve this, I have enabled logprobs in the initialization of the OpenAILike class, as shown below:

OpenAILike( api_base="http://url:port/v1", api_key="abc456", model="xyz", context_window=8192, max_tokens=1024, temperature=0.3, is_chat_model=True, api_version="", timeout=240, logprobs=True, # Enable logprobs top_logprobs=2, # Set to retrieve top 2 logprobs per token )

I am generating a streaming response using the stream_call function from the chat_engine, which returns an object of StreamingAgentChatResponse. I know that this object’s chat_stream parameter contains the generator used to stream the response, and that this generator yields ChatResponse objects.

Each ChatResponse looks like this:

ChatResponse( message=ChatMessage( role=role, content=content, additional_kwargs=additional_kwargs, ), delta=content_delta, raw=response, # Contains the full response, including log probs additional_kwargs=self._get_response_token_counts(response), )

I understand that the raw parameter of the ChatResponse contains the full response from llm, which includes the log probabilities. However, I am unsure how to access this raw parameter directly from the StreamingAgentChatResponse object.

How can I access the logprobs in the raw response for each token in a streaming response? Is there a specific way to retrieve this information from the StreamingAgentChatResponse object or the ChatResponse object yielded by the generator?

dosubot[bot] commented 1 month ago

To access the log probabilities (logprobs) in the raw response for each token in a streaming response using the StreamingAgentChatResponse object, you can follow these steps:

  1. Initialize the Streaming Response: Ensure that you have set up the StreamingAgentChatResponse object correctly and that it is streaming responses.

  2. Access the Generator: The chat_stream attribute of the StreamingAgentChatResponse object contains the generator that yields ChatResponse objects.

  3. Extract Log Probabilities: Use the extract_logprobs or extract_logprobs_stream utility functions to extract the log probabilities from the raw attribute of each ChatResponse object.

Here is an example of how you can achieve this:

# Assuming you have an instance of StreamingAgentChatResponse
streaming_response = StreamingAgentChatResponse(chat_stream=your_chat_stream_generator)

# Function to extract log probabilities from a ChatResponse object
def extract_logprobs_from_chat_response(chat_response):
    raw_response = chat_response.raw
    if raw_response:
        return extract_logprobs(raw_response)
    return None

# Iterate over the generator to get ChatResponse objects
for chat_response in streaming_response.chat_stream:
    logprobs = extract_logprobs_from_chat_response(chat_response)
    if logprobs:
        print("Log probabilities:", logprobs)

Utility Functions

Here are the utility functions to extract log probabilities:

def extract_logprobs_choice(choice) -> Optional[List[LogProb]]:
    if choice.logprobs is not None:
        logprobs = []
        for logprob in choice.logprobs.content:
            logprobs.append(
                LogProb(
                    logprob=logprob.logprob,
                    bytes=logprob.bytes,
                    token=logprob.token,
                )
            )
    else:
        logprobs = None
    return logprobs

def extract_logprobs(response) -> Optional[List[List[LogProb]]]:
    if response.choices[0].logprobs is not None:
        choice_logprobs = []
        for choice in response.choices:
            choice_logprobs.append(extract_logprobs_choice(choice))
    else:
        choice_logprobs = None
    return choice_logprobs

def extract_logprobs_stream(response) -> Optional[List[List[LogProb]]]:
    if response.choices[0].logprobs is not None:
        logprobs = [extract_logprobs_choice(response.choices[0])]
    else:
        logprobs = None
    return logprobs

By following these steps, you can access the log probabilities for each token in a streaming response from the StreamingAgentChatResponse object [1].

To continue talking to Dosu, mention @dosu.

Sanjit0910 commented 1 month ago

But I am just not able to Iterate over the generator to get ChatResponse objects. I checked and it is not None.

dosubot[bot] commented 1 month ago

To iterate over the generator in the chat_stream attribute of the StreamingAgentChatResponse object to get ChatResponse objects, you can use a simple for loop. Here is an example:

# Assuming `streaming_response` is an instance of StreamingAgentChatResponse
for chat_response in streaming_response.chat_stream:
    # Process each ChatResponse object
    print(chat_response)

If the chat_stream is an asynchronous generator, you would use an asynchronous for loop:

# Assuming `streaming_response` is an instance of StreamingAgentChatResponse
async for chat_response in streaming_response.chat_stream:
    # Process each ChatResponse object
    print(chat_response)

These loops will allow you to iterate over each ChatResponse object yielded by the chat_stream generator. If you need to add back an initial chunk to the stream, you can use the _add_back_chunk_to_stream or _async_add_back_chunk_to_stream methods provided in the ReActAgent class:

# Synchronous example
updated_stream = agent._add_back_chunk_to_stream(chunk, streaming_response.chat_stream)
for chat_response in updated_stream:
    print(chat_response)

# Asynchronous example
updated_stream = await agent._async_add_back_chunk_to_stream(chunk, streaming_response.chat_stream)
async for chat_response in updated_stream:
    print(chat_response)

These methods help in managing the stream by adding back an initial chunk to the beginning of the chat_stream [1][2][3][4][5].

To continue talking to Dosu, mention @dosu.