`HuggingFaceEndpoint` does not raise exceptions when API call fails due to token counts.

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

bearer_token = hf_bearer_token()

# Get the HuggingFace API URL
ep: InferenceEndpoint = local_utils.hf.inference.llama_31_8B_Instruct(wait=True)

# Create the langchain endpoint
llm = HuggingFaceEndpoint(  #
        # repo_id=ep.repository,
        endpoint_url=ep.url,  # + chat_completions,  #
        task="text-generation",  #
        huggingfacehub_api_token=bearer_token,  #
)

#Bind parameters
llm = llm.bind(max_tokens=8192, temperature=None)  #.with_retry(stop_after_attempt=99)

# This is a utility class for conversing with the LLama3 model.
# Constrained output via regex or json is broken, so this approach is used instead

@dataclass
class LangChainRawChat:
    llm: LLM
    text: str

    def __init__(self, llm: LLM):
        self.llm = llm
        self.text = "<|begin_of_text|>"

    def system(self, content: str, *, role="system") -> None:
        self.text += f"<|start_header_id|>{role}<|end_header_id|>\n"
        self.text += content
        self.text += "<|eot_id|>"

    def user(self, content: str, *, role="user") -> None:
        self.text += f"<|start_header_id|>{role}<|end_header_id|>\n"
        self.text += content
        self.text += "<|eot_id|>"

    def assistant(self, content: str | None = None, *, role="assistant", temperature=0.0) -> str:
        self.text += f"<|start_header_id|>{role}<|end_header_id|>\n"
        if content is not None:
            self.text += content
        _temperature = temperature if temperature > 0.0 else None
        output = self.llm.invoke(self.text, max_tokens=8192, temperature=_temperature,
                                 stop_sequence="<|eot_id|>")
        self.text += output
        self.text += "<|eot_id|>"
        return content + output

# Sample code to run one "chat" session
src_article = """An article whose length along with prompt and output format instructions exceed the token limit"""

job_chat: LangChainRawChat = LangChainRawChat(llm)
job_chat.system(system_prompt)
job_chat.user(prompt.format(format=format, text=src_article))
response = job_chat.assistant("# Analysis\n\n## Entities and concepts\n\n")

Error Message and Stack Trace (if applicable)

The HuggingFaceEndpoint silently fails.

The API endpoint shows in the logs:

`inputs` tokens + `max_new_tokens` must be <= 16384. Given: 20016 `inputs` tokens and 512 `max_new_tokens`

The error message also implies that max_new_tokens from the bind on the LLM is being ignored for actual API request calls.

Description

I'm trying to use the langchain library to interface with Hugging Face Dedicated Endpoints.

System Info

System Information

OS: Linux OS Version: #44-Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 13 13:35:26 UTC 2024 Python Version: 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:36:51) [GCC 12.4.0]

Package Information

langchain_core: 0.2.38 langchain: 0.2.16 langchain_community: 0.2.16 langsmith: 0.1.117 langchain_cli: 0.0.30 langchain_huggingface: 0.0.3 langchain_llm: 0.4.15 langchain_openai: 0.1.23 langchain_text_splitters: 0.2.4 langgraph: 0.2.19 langserve: 0.2.3

Other Dependencies

accelerate: 0.34.2 aiohttp: 3.10.5 async-timeout: Installed. No version info available. cpm_kernels: 1.0.11 dataclasses-json: 0.6.7 einops: 0.8.0 fastapi: 0.114.0 gitpython: 3.1.43 httpx: 0.27.2 huggingface-hub: 0.24.6 jsonpatch: 1.33 langgraph-checkpoint: 1.0.9 langserve[all]: Installed. No version info available. libcst: 1.4.0 loguru: 0.7.2 numpy: 1.26.4

langchain-ai / langchain

`HuggingFaceEndpoint` does not raise exceptions when API call fails due to token counts. #26525