langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.29k stars 15.01k forks source link

Tutorial is not working with local model (e.g Llama3) due to chat template #26656

Closed NourOM02 closed 1 week ago

NourOM02 commented 3 weeks ago

URL

https://python.langchain.com/docs/tutorials/sql_qa/

Checklist

Issue with current documentation:

Goal

Create a SQL agent that ineracts with a SQL database using a local model.

My implementation

I am trying to use a local model from huggingface and then create a ChatModel instance using ChatHuggingFace class. I implemented the same code for the agent as explained in the above tutorial, with the necessary changes to work with a huggingface model.

Configuring LangSmith and HuggingFace tokens

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = ""
os.environ["HF_TOKEN"] = ""

Needed packages

# Packages required to load the model
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

# Packages required to set SQL agent with langchain
from langchain_community.utilities import SQLDatabase
from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline
from langchain_community.agent_toolkits import SQLDatabaseToolkit
from langchain_core.messages import SystemMessage, HumanMessage
from langgraph.prebuilt import create_react_agent

Setup the connection to the database

db = SQLDatabase.from_uri("sqlite:///Chinook.db")

Setup the LLM

# Model to use
model_id = "google/gemma-2-2b-it"

# Quantization Configuration:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype="auto"
)

Build hugging face pipeline to use the model with the langchain package

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=100, top_k=50)
llm = HuggingFacePipeline(pipeline=pipe)
llm = ChatHuggingFace(llm=llm)

Create Agent


toolkit = SQLDatabaseToolkit(db=db, llm=llm)
tools = toolkit.get_tools()

SQL_PREFIX = """You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct SQLite query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most 5 results.
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the below tools. Only use the information returned by the below tools to construct your final answer.
You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.

To start you should ALWAYS look at the tables in the database to see what you can query.
Do NOT skip this step.
Then you should query the schema of the most relevant tables."""

system_message = SystemMessage(content=SQL_PREFIX)

agent_executor = create_react_agent(llm, tools, state_modifier=system_message)

Run the agent

for s in agent_executor.stream(
    {"messages": [HumanMessage(content="Which country's customers spent the most?")]}
):
    print(s)
    print("----")

Expected behaviour

As demonstrated by the tutorial, steps taken by the LLM should be displayed before getting the final answer.

Actual behaviour

ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating

Additional experiments

I tried to visualize the chat_template in the loaded model using :

tokenizer.chat_template

I get the following template (which means everything is good): "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '' + role + '\n' + message['content'] | trim + '\n' }}{% endfor %}{% if add_generation_prompt %}{{'model\n'}}{% endif %}"

However, when I try to access the chat template after intializing ChatHuggingFace I notice that there is no chat_template, using :

llm.tokenizer.chat_template

My conclusion is there is a problem with ChatHuggingFace that makes the chat_template missing !

Idea or request for content:

No response

tibor-reiss commented 2 weeks ago

Could you please try llm = ChatHuggingFace(llm=llm, tokenizer=llm.pipeline.tokenizer)?

tibor-reiss commented 2 weeks ago

FYI: updated the docs for easier debugging in https://github.com/huggingface/transformers/pull/33652

NourOM02 commented 1 week ago

I tested your solution which causes no errors, but the model isn't able to make any tool calls (it finish the run within one completion even though the agent should be more robust). I tried ollama to see if the problem is the model's abilities which wasn't the case.

tibor-reiss commented 1 week ago

google/gemma-2-2b-it does not support system calls: "Template error: syntax error: System role not supported"

I also tried other models (e.g. meta-llama), and other methods (e.g. HuggingFaceEndpoint), and they work with ChatHuggingFace.

At this point it seems to me like a model problem, and I would recommend that you provide a MRE, preferably with a model which works / used to work with some early version.

Imho, the original issue (chat_template) raised seems to be resolved.