Closed AvisP closed 6 months ago
🤖
Hello @AvisP! Nice to meet you, I'm Dosu, a friendly bot maintaining the LangChain repository. I'm here to assist you with bugs, answer questions, and help you contribute to our project, while we wait for a human maintainer. I'm currently reviewing your issue and will provide you with a detailed response shortly.
🤖
Based on the context provided, it seems like the Llama2 model is entering into a lengthy question-answer sequence due to the structure defined in the FORMAT_INSTRUCTIONS
string in the prompt.py
file. This structure requires the model to go through a sequence of Thought/Action/Observation steps before providing a final answer.
If you want to prevent this lengthy sequence, you can modify the FORMAT_INSTRUCTIONS
string to simplify the structure. For example, you could remove the requirement for multiple Thought/Action/Observation steps and allow the model to provide a final answer immediately after the first action. Here's how you could modify the FORMAT_INSTRUCTIONS
:
FORMAT_INSTRUCTIONS = """Use a json blob to specify a tool by providing an action key (tool name) and an action_input key (tool input).
Valid "action" values: "Final Answer" or {tool_names}
Provide only ONE action per $JSON_BLOB, as shown:
{{{{ "action": $TOOL_NAME, "action_input": $INPUT }}}}
Follow this format:
Question: input question to answer
Action:
$JSON_BLOB
Observation: action result
Thought: I know what to respond
Action:
{{{{ "action": "Final Answer", "action_input": "Final response to human" }}}}
This modification will allow the model to provide a final answer immediately after the first action, thus preventing the lengthy question-answer sequence. You can find more details about this in the prompt.py file.
Additionally, there are similar issues that have been solved in the LangChain repository. For instance, Weird: LlamaCpp prints questions and asnwers that I did not ask!1 and I just said hi. model is in multiple rounds of conversations with himself. Why?. The suggested solution in these issues is to change the API call from /v1/completions
to /v1/chat/completions
to avoid the "self-answering" situation. This can be done by using ChatOpenAI
instead of OpenAI
in the LLMChain
or ConversationChain
. Here is an example:
ConversationChain(
llm=ChatOpenAI(streaming=True, temperature=0, callback_manager=stream_manager, model_kwargs={"stop": "Human:"}),
memory=ConversationBufferWindowMemory(k=2),
)
I hope this helps! If you have any other questions, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Hi, @AvisP,
I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you are experiencing a Llama-2-13B model entering a lengthy question-answer sequence instead of responding to the initial greeting. I provided a detailed response suggesting modifications to the FORMAT_INSTRUCTIONS
string in the prompt.py
file to simplify the structure and prevent the lengthy sequence. Additionally, I referenced similar issues that have been solved in the LangChain repository and provided suggestions for API call changes to avoid the "self-answering" situation. The response is meant to be a starting point for further research.
Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.
Thank you!
This is still happening, did you find a solution to this?
You can try the following, a solution without using langchain or llama_index
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "google/gemma-2b-it" #or mistralai/Mistral-7B-v0.1
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)
# chat buffer
buffer = []
# function to embed messages in template format
def embed_message(message, role):
return {
"role": role,
"content": message
}
# custom pipeline
class ChatBufferPipeline(Pipeline):
def _sanitize_parameters(self, **kwargs):
preprocess_kwargs = {}
# lookback on chat
if "lookback" in kwargs:
preprocess_kwargs["lookback"] = kwargs["lookback"]
return preprocess_kwargs, {}, {}
def preprocess(self, prompt, lookback=None):
# initial system message
messages = [
{
"role": "user",
"content": "You are a friendly chatbot who answers user questions. You can use the previous examples if this helps you."
},{
"role": "assistant",
"content": "Sounds great! I'm happy to be your friendly chatbot assistant. I'm here to answer your questions and provide you with helpful information. So, what would you like to know today?"
},
]
# get chat history
if lookback:
buffer_messages = buffer[-(lookback):]
messages += buffer_messages
# embed user message in template format
user_message = embed_message(prompt, "user")
messages.append(user_message)
# add new message to buffer
buffer.append(user_message)
messages = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
return self.tokenizer(messages, return_tensors="pt").input_ids.cuda()
def _forward(self, model_inputs):
outputs = self.model.generate(model_inputs, max_new_tokens=250, min_new_tokens=20)
return {"outputs": outputs, "inputs": model_inputs}
def postprocess(self, model_outputs):
outputs = model_outputs["outputs"]
inputs = model_outputs["inputs"]
assistant_output = self.tokenizer.decode(outputs[0][len(inputs[0]):], add_special_tokens=False)
buffer.append(embed_message(assistant_output, "assistant"))
full_dialog = self.tokenizer.decode(outputs[0])
return assistant_output, full_dialog
chatpipe = ChatBufferPipeline(model=model, tokenizer=tokenizer)
_, dialog = chatpipe("My favorite color is blue, what is yours?")
_, dialog = chatpipe("What did I tell you my favorite color was?", lookback=10)
I was interested in making it work while sticking to Llama2 (and not changing the model) 😅 I also tried calling Llama2 using the Bedrock API (without langchain) but still observed the same behavior.
Put the LLama2 model_id, maybe it will work. Also look into ollama, which runs locally and you will not have the issue.
Issue you'd like to raise.
I was following the tutorial here and instead of OpenAI, I was trying to use a LLama2 model. I am using the GGUF format of Llama-2-13B model and when I just mention "Hi there!" it goes into the following question answer sequence. Why is that happening and how to prevent it?
I am new to this and any hjelp or suggestion would be appreciated!
It is to be noted that the model is generating the subsequent question and answering itself after the first response of "Hello! How can I help you?" The code snippet I am using is provided below
Suggestion:
No response