Open npn-zakipoint opened 1 month ago
I am facing a similar issue with the Llama-3.1-8B-Instruct model. Is there any way we can increase the response token limit to more than 100? @npn-zakipoint
@AnandUgale , I couldn't find any mehtod to get the token more than 100. But you can do that by invoking llm directly using HuggingFaceEndpoint
instead of using ChatHuggingFace
. But, I found it hallucinate often, because ChatHuggingFace might use instruction-tuned model instead of base model.
I am have worked of a similar issue #25136. Try passing in max_tokens
again into ChatHuggingFace
. From my experience, it might be overriding the max_tokens
you passed into HuggingFaceEndPoint
. I am only have experience with HuggingFacePipeline
though. Would love to hear if it works!
Hi @Soumil32 ,
Thank you for your suggestion. For the locally downloaded model, it is asking for a token or Hugging Face API key. I tried modifying huggingface.py
, but now I am getting the following error:
ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating.
Hi @Soumil32,
Thank you for your suggestion, I already did those experiments by setting the max_tokens
to ChatHuggingFace
as well, but it couldn't help to get rid of above issue. I am still getting 100 tokens as response. The issue still persists.
I encountered this issue as well. The max_tokens
or max_new_tokens
parameter is never passed to the InferenceClient
.
I was using ChatHuggingFace
with tool calling and eventually solved this by setting max_tokens
via the bind_tools
method because those keyword arguments actually get passed to the InferenceClient
. E.g.
chat = ChatHuggingFace(llm=llm).bind_tools(
tools=tools,
max_tokens=2048
)
But this only works if you are actually binding tool like objects to the model, so I don't know if this is applicable to your case. I haven't found another solution for this.
I encountered this issue as well. The
max_tokens
ormax_new_tokens
parameter is never passed to theInferenceClient
.I was using
ChatHuggingFace
with tool calling and eventually solved this by settingmax_tokens
via thebind_tools
method because those keyword arguments actually get passed to theInferenceClient
. E.g.chat = ChatHuggingFace(llm=llm).bind_tools( tools=tools, max_tokens=2048 )
But this only works if you are actually binding tool like objects to the model, so I don't know if this is applicable to your case. I haven't found another solution for this.
It looks like ChatHuggingFace
completely discards the LLM provided parameters for invocations and only seems to use what is supplied via the bind_tools
method.
This seems to work OK when using dedicated endpoints:
# Set the api token
if not os.getenv("HUGGINGFACE_API_TOKEN"):
os.environ["HUGGINGFACE_API_TOKEN"] = hf_bearer_token()
bearer_token = os.environ["HUGGINGFACE_API_TOKEN"]
# Endpoint
ep: InferenceEndpoint = local_utils.hf.inference.llama_31_8B_Instruct(wait=True)
llm = HuggingFaceEndpoint(endpoint_url=ep.url, # + chat_completions, #
huggingfacehub_api_token=bearer_token, #
max_new_tokens=8192, #
repetition_penalty=1.2, #
streaming=True, #
task="text-generation", #
)
# Chat
chat_completions = "/v1/chat/completions"
llm.client.model += chat_completions
llm.async_client.model += chat_completions
chat_model = ChatHuggingFace(llm=llm, verbose=True).bind_tools(tools=[], max_tokens=8192)
# Test chat data
system_msg = SystemMessage("You are a senior level Visual Foxpro programmer assisting interns.")
user_msg = HumanMessage("Write the game snake using Visual Foxpro 9 for a DOS terminal.")
input_message = [system_msg, user_msg]
# Invoke
output_msg = chat_model.invoke(input_message)
# Display result
md_text = output_msg.content
display_markdown(md_text, raw=True)
display_markdown("**Finish Reason**: " + output_msg.response_metadata["finish_reason"], raw=True)
The issue title and description needs updating to reflect that ChatHuggingFace
ignores the parameters set during LLM configuration.
Why isn't there a way to set these dynamic parameters on a per call basis?
Or are we supposed to use bind_tools
for each invocation setting the values each time then discarding the chat object?
-Mike
Following this as I am experimenting the same problem.
Checked other resources
Example Code
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace from langchain_community.callbacks import get_openai_callback
llm = HuggingFaceEndpoint( repo_id=repo_id, temperature=0.01,max_new_tokens=2048, huggingfacehub_api_token=HUGGINGFACE_API_KEY) llm = ChatHuggingFace(llm=llm)
messages = [ ( "system", "You are a smart AI that understand the tabular data structure.", ), ("user", f"{prompt}"), ]
with get_openai_callback() as cb: response = llm.invoke(messages) print(cb) if not isinstance(response, str): response = response.content
print(response)****
Error Message and Stack Trace (if applicable)
Tokens Used: 1668 Prompt Tokens: 1568 Completion Tokens: 100 Successful Requests: 1
Description
I am trying to use the
Mistral-7B
model from the huggingface. While I am using theHuggingFaceEndpoint
I am getting the expected answer. But while using theChatHuggingFace
, I am always getting 100 tokens. I have gone through the existing issues but couldn't find the solutions yet.System Info
System Information
Package Information