ChatHuggingFace always returns only 100 tokens as response without considering the `max_new_tokens` parameter

npn-zakipoint commented 1 month ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace from langchain_community.callbacks import get_openai_callback

llm = HuggingFaceEndpoint( repo_id=repo_id, temperature=0.01,max_new_tokens=2048, huggingfacehub_api_token=HUGGINGFACE_API_KEY) llm = ChatHuggingFace(llm=llm)

messages = [ ( "system", "You are a smart AI that understand the tabular data structure.", ), ("user", f"{prompt}"), ]

with get_openai_callback() as cb: response = llm.invoke(messages) print(cb) if not isinstance(response, str): response = response.content

print(response)****

Error Message and Stack Trace (if applicable)

Tokens Used: 1668 Prompt Tokens: 1568 Completion Tokens: 100 Successful Requests: 1

Description

I am trying to use the Mistral-7B model from the huggingface. While I am using the HuggingFaceEndpoint I am getting the expected answer. But while using the ChatHuggingFace , I am always getting 100 tokens. I have gone through the existing issues but couldn't find the solutions yet.

System Info

System Information

OS: Linux OS Version: #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jun 18 14:36:16 UTC 2 Python Version: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0]

Package Information

langchain_core: 0.2.29 langchain: 0.2.12 langchain_community: 0.2.11 langsmith: 0.1.98 langchain_huggingface: 0.0.3 langchain_text_splitters: 0.2.2

AnandUgale commented 1 month ago

I am facing a similar issue with the Llama-3.1-8B-Instruct model. Is there any way we can increase the response token limit to more than 100? @npn-zakipoint

npn-zakipoint commented 1 month ago

@AnandUgale , I couldn't find any mehtod to get the token more than 100. But you can do that by invoking llm directly using HuggingFaceEndpoint instead of using ChatHuggingFace. But, I found it hallucinate often, because ChatHuggingFace might use instruction-tuned model instead of base model.

Soumil32 commented 1 month ago

I am have worked of a similar issue #25136. Try passing in max_tokens again into ChatHuggingFace. From my experience, it might be overriding the max_tokens you passed into HuggingFaceEndPoint. I am only have experience with HuggingFacePipeline though. Would love to hear if it works!

AnandUgale commented 1 month ago

Hi @Soumil32 ,

Thank you for your suggestion. For the locally downloaded model, it is asking for a token or Hugging Face API key. I tried modifying huggingface.py, but now I am getting the following error:

ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating.

npn-zakipoint commented 1 month ago

Hi @Soumil32,

Thank you for your suggestion, I already did those experiments by setting the max_tokens to ChatHuggingFace as well, but it couldn't help to get rid of above issue. I am still getting 100 tokens as response. The issue still persists.

phucdev commented 1 month ago

I encountered this issue as well. The max_tokens or max_new_tokens parameter is never passed to the InferenceClient.

I was using ChatHuggingFace with tool calling and eventually solved this by setting max_tokens via the bind_tools method because those keyword arguments actually get passed to the InferenceClient. E.g.

chat = ChatHuggingFace(llm=llm).bind_tools(
    tools=tools, 
    max_tokens=2048
)

But this only works if you are actually binding tool like objects to the model, so I don't know if this is applicable to your case. I haven't found another solution for this.

michael-newsrx commented 1 month ago

I encountered this issue as well. The max_tokens or max_new_tokens parameter is never passed to the InferenceClient.

I was using ChatHuggingFace with tool calling and eventually solved this by setting max_tokens via the bind_tools method because those keyword arguments actually get passed to the InferenceClient. E.g.
chat = ChatHuggingFace(llm=llm).bind_tools(
    tools=tools, 
    max_tokens=2048
)
But this only works if you are actually binding tool like objects to the model, so I don't know if this is applicable to your case. I haven't found another solution for this.

It looks like ChatHuggingFace completely discards the LLM provided parameters for invocations and only seems to use what is supplied via the bind_tools method.

This seems to work OK when using dedicated endpoints:

# Set the api token

if not os.getenv("HUGGINGFACE_API_TOKEN"):
    os.environ["HUGGINGFACE_API_TOKEN"] = hf_bearer_token()
bearer_token = os.environ["HUGGINGFACE_API_TOKEN"]

# Endpoint

ep: InferenceEndpoint = local_utils.hf.inference.llama_31_8B_Instruct(wait=True)
llm = HuggingFaceEndpoint(endpoint_url=ep.url, # + chat_completions,  #                          
                          huggingfacehub_api_token=bearer_token,  #
                          max_new_tokens=8192,  #
                          repetition_penalty=1.2,  #
                          streaming=True,  #
                          task="text-generation",  #
                          )

# Chat

chat_completions = "/v1/chat/completions"
llm.client.model += chat_completions
llm.async_client.model += chat_completions
chat_model = ChatHuggingFace(llm=llm, verbose=True).bind_tools(tools=[], max_tokens=8192)

# Test chat data

system_msg = SystemMessage("You are a senior level Visual Foxpro programmer assisting interns.")
user_msg = HumanMessage("Write the game snake using Visual Foxpro 9 for a DOS terminal.")
input_message = [system_msg, user_msg]

# Invoke

output_msg = chat_model.invoke(input_message)

# Display result

md_text = output_msg.content
display_markdown(md_text, raw=True)
display_markdown("**Finish Reason**: " + output_msg.response_metadata["finish_reason"], raw=True)

The issue title and description needs updating to reflect that ChatHuggingFace ignores the parameters set during LLM configuration.

Why isn't there a way to set these dynamic parameters on a per call basis?

Or are we supposed to use bind_tools for each invocation setting the values each time then discarding the chat object?

-Mike

magallardo commented 1 month ago

Following this as I am experimenting the same problem.

langchain-ai / langchain