Closed taoari closed 4 months ago
Yeah this is popular enough that we could consider adding this. @yvrjsharma would you like to take this on?
Can I do this?
Go for it @Saigenix! We'd welcome a contribution
Go for it @Saigenix! We'd welcome a contribution
hello, I don't have a paid openai account which's why I can't able to check its working or not Can you checkout this code?
langchain
example with streaming supportThis will be same as above example with extra streaming support. Some Chat models provide a streaming response. This means that instead of waiting for the entire response to be returned, you can start processing it as soon as it's available. This is useful if you want to display the response to the user as it's being generated, or if you want to process the response as it's being generated.
from langchain.chat_models import ChatOpenAI
from langchain.schema import (
HumanMessage,AIMessage
)
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
import gradio as gr
# os.envrion["OPENAI_API_KEY"] = "" # Replace with your key
def predict(message, history):
history_langchain_format = []
for human, ai in history:
history_langchain_format.append(HumanMessage(content=human))
history_langchain_format.append(AIMessage(content=ai))
history_langchain_format.append(HumanMessage(content=message))
gpt_response = ChatOpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], temperature=0,openai_api_key="...")
resp = gpt_response(history_langchain_format)
return resp.content
gr.ChatInterface(predict).launch()
@Saigenix This does not work. For Gradio streaming, the predict function should be a generator function. The complex part is that LangChain does not return a generator even with streaming=True.
@Saigenix This does not work. For Gradio streaming, the predict function should be a generator function. The complex part is that LangChain does not return a generator even with streaming=True.
wait i will try another way
@Saigenix This does not work. For Gradio streaming, the predict function should be a generator function. The complex part is that LangChain does not return a generator even with streaming=True.
hey I tried with this
# Callbacks support token-wise streaming
class StreamingStdOutCallbackHandler(BaseCallbackHandler):
def __init__(self,initial_text=""):
self.text=initial_text
def on_llm_new_token(self, token: str, **kwargs) -> None:
# "/" is a marker to show difference
# you don't need it
self.text+=token+"/"
Do you know any way to update chatbox content when the on_llm_new_token()
function gets called
@Saigenix i do not know simple ways to achieve this. I see examples using subprocess or websocket, the codes are quite difficult to understand. So I am wondering if this can be implemented. In langchain, there are streamlit and stdout callback functions. langchain streaming works for both stdout and streamlit, do not know why langchain does not have one gradio callback function bulitin. Is this really hard to implement?
@taoari yes This will be much easier if they provide callback function for gradio
I couldn't find a simple way to do it, but I found a pragmatic solution. I'm not saying this is the recommended way. I just needed to perform a demo session with a client, and I wanted to use streaming with my architecture. This is how I did it adapting this code using a Queue and a Generator function (thanks to this guy). Define the callback:
from langchain.callbacks.base import BaseCallbackHandler
class QueueCallback(BaseCallbackHandler):
"""Callback handler for streaming LLM responses to a queue."""
def __init__(self, q):
self.q = q
def on_llm_new_token(self, token: str, **kwargs: any) -> None:
self.q.put(token)
def on_llm_end(self, *args, **kwargs: any) -> None:
return self.q.empty()
The stream function:
def stream(input_text) -> Generator:
# Create a Queue
q = Queue()
job_done = object()
"""Logic for loading the chain you want to use should go here."""
llm = ChatOpenAI(
streaming=True,
model='gpt-3.5-turbo-0613',
callbacks=[QueueCallback(q)],
temperature=0
)
conversation = ConversationChain(
prompt=PROMPT,
llm=llm,
verbose=True
)
# Create a funciton to call - this will run in a thread
def task():
resp = conversation.run(input_text)
q.put(job_done)
# Create a thread and start the function
t = Thread(target=task)
t.start()
content = ""
# Get each new token from the queue and yield for our generator
while True:
try:
next_token = q.get(True, timeout=1)
if next_token is job_done:
break
content += next_token
yield next_token, content
except Empty:
continue
and finally calling it from ChatInterface
:
def ask_llm(message, history):
for next_token, content in stream(message):
yield(content)
chatInterface = gr.ChatInterface(
fn=ask_llm,
...
Hope it helps, good luck!
did
it's very helpful, thak you
This should now be possible using LC's Chain.stream()
function. E.g. for a chatbot:
from operator import itemgetter
import os
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.memory import ConversationBufferMemory
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough
# Initialize chat model
llm = ChatOpenAI(openai_api_key=os.environ["OPENAI_API_KEY"])
# Define a prompt template
template = """You are a helpful AI assistant. You give specialized advice on travel.
"""
chat_prompt = ChatPromptTemplate.from_messages(
[
("system", template),
MessagesPlaceholder(variable_name="history"),
("human", "{input}"),
]
)
# Create conversation history store
memory = ConversationBufferMemory(memory_key="history", return_messages=True)
# Initialize chain
# chain = LLMChain(
# llm=llm,
# prompt=chat_prompt,
# # verbose=True,
# memory=memory,
# )
chain = (
RunnablePassthrough.assign(
history=RunnableLambda(memory.load_memory_variables) | itemgetter("history")
)
| chat_prompt
| llm
)
def stream_response(input, history):
if input is not None:
# ChatInterface struggles with rendering stream
for response in chain.stream({"input": input}):
print(response.content)
yield response.content
# UI
import gradio as gr
gr.ChatInterface(stream_response).queue().launch()
From the print statement, you can see that the response is being generated correctly. Unfortunately, ChatInterface
fails to display the results:
Does anyone know why this might be?
I can debug this more deeply at my end, but have you already tried setting debug=True
in launch()
to look at the log what error it is showing?
Hi @yvrjsharma - thanks for the response. Just tried debug=True
. I see no error:
I also tried stepping through each generation. I see now that each word is replacing the previous word, instead of adding to it. E.g.
This seems to happen in Chrome and Safari. Thoughts?
ah, sweet, thanks for sharing this. I think we just need to return the full message from the stream_response
function instead of a single word. I used your above repro to come up with a working solution below. Do you want to try it out and see if that works for you too?
from operator import itemgetter
import os
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.memory import ConversationBufferMemory
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough
# Initialize chat model
llm = ChatOpenAI(openai_api_key="sk-your-key")
# Define a prompt template
template = """You are a helpful AI assistant. You give specialized advice on travel.
"""
chat_prompt = ChatPromptTemplate.from_messages(
[
("system", template),
MessagesPlaceholder(variable_name="history"),
("human", "{input}"),
]
)
# Create conversation history store
memory = ConversationBufferMemory(memory_key="history", return_messages=True)
chain = (
RunnablePassthrough.assign(
history=RunnableLambda(memory.load_memory_variables) | itemgetter("history")
)
| chat_prompt
| llm
)
def stream_response(input, history):
if input is not None:
partial_message = ""
# ChatInterface struggles with rendering stream
for response in chain.stream({"input": input}):
partial_message += response.content
print(partial_message)
yield partial_message
# UI
import gradio as gr
gr.ChatInterface(stream_response).queue().launch(debug=True)
screenshot -
Ah I misunderstood the implementation. That works for me! Thank you 😁
For others who come across this thread:
Others may have better solutions, but one way to fix is by updating the stream_response function as follows:
def stream_response(message, history):
print(f"Input: {message}. History: {history}\n")
if history:
human, ai = history[-1]
memory.chat_memory.add_user_message(HumanMessage(content=human))
memory.chat_memory.add_ai_message(AIMessage(content=ai))
print(f"Memory in chain: \n{memory.chat_memory} \n")
if message is not None:
partial_message = ""
# ChatInterface struggles with rendering stream
for response in chain.stream({"input": message}):
partial_message += response.content
# print(partial_message)
yield partial_message
Thanks @bent-verbiage , I finish it without memory store.
import os
from langchain_openai import ChatOpenAI
from langchain.schema import AIMessage, HumanMessage
import gradio as gr
os.environ["OPENAI_API_KEY"] = "sk-xxx"
# Initialize chat model
llm = ChatOpenAI(temperature=0.7, model='gpt-4', streaming=True)
def stream_response(message, history):
print(f"Input: {message}. History: {history}\n")
history_langchain_format = []
for human, ai in history:
history_langchain_format.append(HumanMessage(content=human))
history_langchain_format.append(AIMessage(content=ai))
if message is not None:
history_langchain_format.append(HumanMessage(content=message))
partial_message = ""
for response in llm.stream(history_langchain_format):
partial_message += response.content
yield partial_message
iface = gr.ChatInterface(
stream_response,
textbox=gr.Textbox(placeholder="Message ChatGPT...", container=False, scale=7),
)
iface.launch(share=True)
How can I simulate streaming output response in my code ?
import logging
import sys
import torch
import requests
import subprocess
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms.huggingface import HuggingFaceLLM
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.legacy.embeddings.langchain import LangchainEmbedding
from llama_index.core.prompts.prompts import SimpleInputPrompt
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
# Open the left folder icon menu and create a folder named sample and upload documents (pdfs)
documents = SimpleDirectoryReader("/content/G2").load_data()
system_prompt = "You are a Q&A assistant. Your goal is to answer questions as accurately as possible based on the instructions and context provided."
query_wrapper_prompt = SimpleInputPrompt("<|USER|>{query_str}<|ASSISTANT|>")
import torch
llm = HuggingFaceLLM(
context_window=4096,
max_new_tokens=256,
generate_kwargs={"temperature": 0.0, "do_sample": False},
system_prompt=system_prompt,
query_wrapper_prompt=query_wrapper_prompt,
tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
model_name="meta-llama/Llama-2-7b-chat-hf",
device_map="auto",
model_kwargs={"torch_dtype": torch.float16 , "load_in_8bit":True}
)
embed_model = LangchainEmbedding(
HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
)
service_context = ServiceContext.from_defaults(
chunk_size=1024,
llm=llm,
embed_model=embed_model
)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
import gradio as gr
# Define your query_index function here
def query_index(query, history):
query_engine = index.as_query_engine()
response = query_engine.query(query)
return str(response)
demo = gr.ChatInterface(
fn=query_index,
title="G2Bot"
)
# Launch the Gradio Chat Interface
demo.launch(debug=True)
Would love help asap, thank you in advance.
I looked into this, but the langchain docs offer so many different ways to stream LLMs that I'm not sure what the best example to add to our docs. I'd recommend just using the openai streaming example and modifying as necessary: https://www.gradio.app/guides/creating-a-chatbot-fast#a-streaming-example-using-openai
If someone has concrete issues getting this to work, its best to ask in our Discord server. (I'll close this issue)
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
It would be great if there could be a gradio langchain example with streaming support.
There is an example in the Guide: https://www.gradio.app/guides/creating-a-chatbot-fast#a-langchain-example for LangChain, but there is no streaming support. LangChain supports streaming in a callback way (https://python.langchain.com/docs/modules/model_io/models/chat/streaming), but the official example only streams to stdout. How can we stream langchain LLM to Gradio Chatbot messages?
Describe the solution you'd like
A clear and concise description of what you want to happen.
A gradio langchain example with streaming support is provoided in https://www.gradio.app/guides/creating-a-chatbot-fast#a-langchain-example.
Additional context
Add any other context or screenshots about the feature request here.