Load LLM (Mixtral 8x22B) from Azure AI endpoint as Langchain Model - BaseMessage instead of AIMessage

weissenbacherpwc commented 1 month ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.chat_models.azureml_endpoint import AzureMLChatOnlineEndpoint
from langchain_community.llms.azureml_endpoint import ContentFormatterBase
from langchain_community.chat_models.azureml_endpoint import (
    AzureMLEndpointApiType,
    CustomOpenAIChatContentFormatter,
)
from langchain_core.messages import HumanMessage

chat = AzureMLChatOnlineEndpoint(
    endpoint_url="https://llm-host-westeurope-mx8x22bi.westeurope.inference.ml.azure.com/score",
    endpoint_api_type=AzureMLEndpointApiType.dedicated,
    endpoint_api_key="xY1BWYshxYJhQGZE6P7Uc1of34BW9b5t",
    content_formatter=CustomOpenAIChatContentFormatter(),
)

response = chat.invoke(
    [HumanMessage(content="Hallo")],max_tokens=512
)
response

Error Message and Stack Trace (if applicable)

I think I have set up the right deployment type. See here the full trace:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File [~/anaconda3/lib/python3.11/site-packages/langchain_community/chat_models/azureml_endpoint.py:140](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/~/anaconda3/lib/python3.11/site-packages/langchain_community/chat_models/azureml_endpoint.py:140), in CustomOpenAIChatContentFormatter.format_response_payload(self, output, api_type)
    139 try:
--> 140     choice = json.loads(output)["output"]
    141 except (KeyError, IndexError, TypeError) as e:

KeyError: 'output'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
[/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/test.ipynb](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/test.ipynb) Zelle 4 line 8
      [5](vscode-notebook-cell:/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/test.ipynb#Y133sZmlsZQ%3D%3D?line=4) prompt = ChatPromptTemplate.from_messages([("system", system), ("human", human)])
      [7](vscode-notebook-cell:/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/test.ipynb#Y133sZmlsZQ%3D%3D?line=6) chain = prompt | chat
----> [8](vscode-notebook-cell:/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/test.ipynb#Y133sZmlsZQ%3D%3D?line=7) chain.invoke({"text": "Explain the importance of low latency for LLMs."})

File [~/anaconda3/lib/python3.11/site-packages/langchain_core/runnables/base.py:2507](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/~/anaconda3/lib/python3.11/site-packages/langchain_core/runnables/base.py:2507), in RunnableSequence.invoke(self, input, config, **kwargs)
   2505             input = step.invoke(input, config, **kwargs)
   2506         else:
-> 2507             input = step.invoke(input, config)
   2508 # finish the root run
   2509 except BaseException as e:

File [~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:248](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:248), in BaseChatModel.invoke(self, input, config, stop, **kwargs)
    237 def invoke(
    238     self,
    239     input: LanguageModelInput,
   (...)
    243     **kwargs: Any,
    244 ) -> BaseMessage:
    245     config = ensure_config(config)
    246     return cast(
    247         ChatGeneration,
--> 248         self.generate_prompt(
    249             [self._convert_input(input)],
    250             stop=stop,
    251             callbacks=config.get("callbacks"),
    252             tags=config.get("tags"),
    253             metadata=config.get("metadata"),
    254             run_name=config.get("run_name"),
    255             run_id=config.pop("run_id", None),
    256             **kwargs,
    257         ).generations[0][0],
    258     ).message

File [~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:677](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:677), in BaseChatModel.generate_prompt(self, prompts, stop, callbacks, **kwargs)
    669 def generate_prompt(
    670     self,
    671     prompts: List[PromptValue],
   (...)
    674     **kwargs: Any,
    675 ) -> LLMResult:
    676     prompt_messages = [p.to_messages() for p in prompts]
--> 677     return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)

File [~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:534](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:534), in BaseChatModel.generate(self, messages, stop, callbacks, tags, metadata, run_name, run_id, **kwargs)
    532         if run_managers:
    533             run_managers[i].on_llm_error(e, response=LLMResult(generations=[]))
--> 534         raise e
    535 flattened_outputs = [
    536     LLMResult(generations=[res.generations], llm_output=res.llm_output)  # type: ignore[list-item]
    537     for res in results
    538 ]
    539 llm_output = self._combine_llm_outputs([res.llm_output for res in results])

File [~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:524](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:524), in BaseChatModel.generate(self, messages, stop, callbacks, tags, metadata, run_name, run_id, **kwargs)
    521 for i, m in enumerate(messages):
    522     try:
    523         results.append(
--> 524             self._generate_with_cache(
    525                 m,
    526                 stop=stop,
    527                 run_manager=run_managers[i] if run_managers else None,
    528                 **kwargs,
    529             )
    530         )
    531     except BaseException as e:
    532         if run_managers:

File [~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:749](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:749), in BaseChatModel._generate_with_cache(self, messages, stop, run_manager, **kwargs)
    747 else:
    748     if inspect.signature(self._generate).parameters.get("run_manager"):
--> 749         result = self._generate(
    750             messages, stop=stop, run_manager=run_manager, **kwargs
    751         )
    752     else:
    753         result = self._generate(messages, stop=stop, **kwargs)

File [~/anaconda3/lib/python3.11/site-packages/langchain_community/chat_models/azureml_endpoint.py:279](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/~/anaconda3/lib/python3.11/site-packages/langchain_community/chat_models/azureml_endpoint.py:279), in AzureMLChatOnlineEndpoint._generate(self, messages, stop, run_manager, **kwargs)
    273 request_payload = self.content_formatter.format_messages_request_payload(
    274     messages, _model_kwargs, self.endpoint_api_type
    275 )
    276 response_payload = self.http_client.call(
    277     body=request_payload, run_manager=run_manager
    278 )
--> 279 generations = self.content_formatter.format_response_payload(
    280     response_payload, self.endpoint_api_type
    281 )
    282 return ChatResult(generations=[generations])

File [~/anaconda3/lib/python3.11/site-packages/langchain_community/chat_models/azureml_endpoint.py:142](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/~/anaconda3/lib/python3.11/site-packages/langchain_community/chat_models/azureml_endpoint.py:142), in CustomOpenAIChatContentFormatter.format_response_payload(self, output, api_type)
    140         choice = json.loads(output)["output"]
    141     except (KeyError, IndexError, TypeError) as e:
--> 142         raise ValueError(self.format_error_msg.format(api_type=api_type)) from e
    143     return ChatGeneration(
    144         message=BaseMessage(
    145             content=choice.strip(),
   (...)
    148         generation_info=None,
    149     )
    150 if api_type == AzureMLEndpointApiType.serverless:

ValueError: Error while formatting response payload for chat model of type  `AzureMLEndpointApiType.dedicated`. Are you using the right formatter for the deployed  model and endpoint type?

Description

Hi,

I set up Mixtral 8x22B on Azure AI/Machine Learning and now want to use it with Langchain. I have difficulties with the format I am getting, e.g. a ChatOpenAI response looks like this:

from langchain_openai import ChatOpenAI
llmm = ChatOpenAI()
llmm.invoke("Hallo")

AIMessage(content='Hallo! Wie kann ich Ihnen helfen?', response_metadata={'token_usage': {'completion_tokens': 8, 'prompt_tokens': 8, 'total_tokens': 16}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='r')

This is how it looks when I am loading Mixtral 8x22B with AzureMLChatOnlineEndpoint:

from langchain_community.chat_models.azureml_endpoint import AzureMLChatOnlineEndpoint

from langchain_community.chat_models.azureml_endpoint import (
    AzureMLEndpointApiType,
    CustomOpenAIChatContentFormatter,
)
from langchain_core.messages import HumanMessage

chat = AzureMLChatOnlineEndpoint(
    endpoint_url="...",
    endpoint_api_type=AzureMLEndpointApiType.dedicated,
    endpoint_api_key="...",
    content_formatter=CustomOpenAIChatContentFormatter(),
)

chat.invoke("Hallo")

BaseMessage(content='Hallo, ich bin ein deutscher Sprachassistent. Was kann ich für', type='assistant', id='run-23')

So with the Mixtral model the output a different format (BaseMessage vs. AIMessage). How can I change this to make it work just like an ChatOpenAI model?

I further explored if it works in a chain with a ChatPromptTemplate without success:

from langchain_core.prompts import ChatPromptTemplate

system = "You are a helpful assistant called Bot."
human = "{text}"
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", human)])

chain = prompt | chat
chain.invoke({"text": "Who are you?"})

This results in KeyError: 'output' and ValueError: Error while formatting response payload for chat model of typeAzureMLEndpointApiType.dedicated. Are you using the right formatter for the deployed model and endpoint type?. See full trace above.

In my application I want to easily switch between these two models.

Thanks in advance!

System Info

langchain 0.2.6 pypi_0 pypi langchain-chroma 0.1.0 pypi_0 pypi langchain-community 0.2.6 pypi_0 pypi langchain-core 0.2.10 pypi_0 pypi langchain-experimental 0.0.49 pypi_0 pypi langchain-groq 0.1.5 pypi_0 pypi langchain-openai 0.1.7 pypi_0 pypi langchain-postgres 0.0.3 pypi_0 pypi langchain-text-splitters 0.2.1

jacoblee93 commented 1 month ago

Hey @weissenbacherpwc, I've opened a PR.

Could you try my branch out and let me know if it fixes the issue?

pip install "git+https://github.com/langchain-ai/langchain.git@jacob/azure#subdirectory=libs/community"

weissenbacherpwc commented 1 month ago

Hi @jacoblee93 I tried out installing your branch. It works now that the response is returned as AIMessage instead of BaseMessage. However when using it in a LCEL or LLMChain, the same error as described occurs.

I tried it with AzureMLOnlineEndpoint and AzureMLChatOnlineEndpoint without success.

jacoblee93 commented 1 month ago

It looks like there's an exported MistralChatContentFormatter - could you try instantiating and passing in that one?

https://api.python.langchain.com/en/latest/chat_models/langchain_community.chat_models.azureml_endpoint.MistralChatContentFormatter.html#

https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/chat_models/azureml_endpoint.py#L187

weissenbacherpwc commented 1 month ago

tried out thanks! However it is still not working, here is the code:

from langchain_community.chat_models.azureml_endpoint import AzureMLChatOnlineEndpoint
from langchain_community.llms.azureml_endpoint import ContentFormatterBase
from langchain_community.chat_models.azureml_endpoint import (
    AzureMLEndpointApiType,
    CustomOpenAIChatContentFormatter,
    MistralChatContentFormatter
)
from langchain_core.messages import HumanMessage

chat = AzureMLChatOnlineEndpoint(
    endpoint_url="https://llm-host-westeurope-oqelx.westeurope.inference.ml.azure.com/score",
    endpoint_api_type=AzureMLEndpointApiType.dedicated,
    endpoint_api_key="",
    content_formatter=MistralChatContentFormatter(),
    #content_formatter=CustomOpenAIChatContentFormatter()
)
# prints UserWarning: `LlamaChatContentFormatter` will be deprecated in the future. 
                Please use `CustomOpenAIChatContentFormatter` instead.
response = chat.invoke(
    [HumanMessage(content="Hallo, whats your name?")],max_tokens=3000
)
response

Here it already fails when invoking the LLM, which worked before with the CustomOpenAIChatFormatter:

ValueError:api_typeAzureMLEndpointApiType.dedicated is not supported by this formatter

weissenbacherpwc commented 1 month ago

@jacoblee93 I might have found a solution to this. I added this code in class class MistralChatContentFormatter(LlamaChatContentFormatter) (from Line 187) azureml_endpoint.py:

        elif api_type == AzureMLEndpointApiType.dedicated:
            request_payload = json.dumps(
                {
                    "input_data": {
                        "input_string": chat_messages,
                        "parameters": model_kwargs,
                    }
                }
            )

See here the full class:

class MistralChatContentFormatter(LlamaChatContentFormatter):
    """Content formatter for `Mistral`."""

    def format_messages_request_payload(
        self,
        messages: List[BaseMessage],
        model_kwargs: Dict,
        api_type: AzureMLEndpointApiType,
    ) -> bytes:
        """Formats the request according to the chosen api"""
        chat_messages = [self._convert_message_to_dict(message) for message in messages]

        if chat_messages and chat_messages[0]["role"] == "system":
            # Mistral OSS models do not explicitly support system prompts, so we have to
            # stash in the first user prompt
            chat_messages[1]["content"] = (
                chat_messages[0]["content"] + "\n\n" + chat_messages[1]["content"]
            )
            del chat_messages[0]

        if api_type == AzureMLEndpointApiType.realtime:
            request_payload = json.dumps(
                {
                    "input_data": {
                        "input_string": chat_messages,
                        "parameters": model_kwargs,
                    }
                }
            )
        elif api_type == AzureMLEndpointApiType.serverless:
            request_payload = json.dumps({"messages": chat_messages, **model_kwargs})
        elif api_type == AzureMLEndpointApiType.dedicated:
            request_payload = json.dumps(
                {
                    "input_data": {
                        "input_string": chat_messages,
                        "parameters": model_kwargs,
                    }
                }
            )
        else:
            raise ValueError(
                f"`api_type` {api_type} is not supported by this formatter"
            )
        return str.encode(request_payload)

With this, I can use the LLM in a chain and give the LLM a system prompt.

weissenbacherpwc commented 1 month ago

@jacoblee93 I might have found a solution to this. I added this code in class class MistralChatContentFormatter(LlamaChatContentFormatter) (from Line 187) azureml_endpoint.py:

        elif api_type == AzureMLEndpointApiType.dedicated:
            request_payload = json.dumps(
                {
                    "input_data": {
                        "input_string": chat_messages,
                        "parameters": model_kwargs,
                    }
                }
            )

See here the full class:

class MistralChatContentFormatter(LlamaChatContentFormatter):
    """Content formatter for `Mistral`."""

    def format_messages_request_payload(
        self,
        messages: List[BaseMessage],
        model_kwargs: Dict,
        api_type: AzureMLEndpointApiType,
    ) -> bytes:
        """Formats the request according to the chosen api"""
        chat_messages = [self._convert_message_to_dict(message) for message in messages]

        if chat_messages and chat_messages[0]["role"] == "system":
            # Mistral OSS models do not explicitly support system prompts, so we have to
            # stash in the first user prompt
            chat_messages[1]["content"] = (
                chat_messages[0]["content"] + "\n\n" + chat_messages[1]["content"]
            )
            del chat_messages[0]

        if api_type == AzureMLEndpointApiType.realtime:
            request_payload = json.dumps(
                {
                    "input_data": {
                        "input_string": chat_messages,
                        "parameters": model_kwargs,
                    }
                }
            )
        elif api_type == AzureMLEndpointApiType.serverless:
            request_payload = json.dumps({"messages": chat_messages, **model_kwargs})
        elif api_type == AzureMLEndpointApiType.dedicated:
            request_payload = json.dumps(
                {
                    "input_data": {
                        "input_string": chat_messages,
                        "parameters": model_kwargs,
                    }
                }
            )
        else:
            raise ValueError(
                f"`api_type` {api_type} is not supported by this formatter"
            )
        return str.encode(request_payload)

With this, I can use the LLM in a chain and give the LLM a system prompt.

Edit: but with this, streaming the LLM output in Langchain is not working:

chunks=[]
for chunk in llm.stream("hello. tell me something about yourself"):
    chunks.append(chunk)
    print(chunk.content, end="|", flush=True)

Results in:

APIStatusError                            Traceback (most recent call last)
[/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/test.ipynb](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/test.ipynb) Zelle 16 line 2
      [1](vscode-notebook-cell:/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/test.ipynb#Y203sZmlsZQ%3D%3D?line=0) chunks=[]
----> [2](vscode-notebook-cell:/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/test.ipynb#Y203sZmlsZQ%3D%3D?line=1) for chunk in llm.stream("hello. tell me something about yourself"):
      [3](vscode-notebook-cell:/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/test.ipynb#Y203sZmlsZQ%3D%3D?line=2)     chunks.append(chunk)
      [4](vscode-notebook-cell:/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/test.ipynb#Y203sZmlsZQ%3D%3D?line=3)     print(chunk.content, end="|", flush=True)

File [~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:375](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:375), in BaseChatModel.stream(self, input, config, stop, **kwargs)
    368 except BaseException as e:
    369     run_manager.on_llm_error(
    370         e,
    371         response=LLMResult(
    372             generations=[[generation]] if generation else []
    373         ),
    374     )
--> 375     raise e
    376 else:
    377     run_manager.on_llm_end(LLMResult(generations=[[generation]]))

File [~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:355](https://file+.vscode-resource.vscode-cdn.net/Users/mweissenba001/Documents/GitHub/fastapi_rag_demo/~/anaconda3/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py:355), in BaseChatModel.stream(self, input, config, stop, **kwargs)
    353 generation: Optional[ChatGenerationChunk] = None
    354 try:
--> 355     for chunk in self._stream(messages, stop=stop, **kwargs):
    356         if chunk.message.id is None:
...
   (...)
   1027     stream_cls=stream_cls,
   1028 )

APIStatusError: Error code: 424 - {'detail': 'Not Found'}

langchain-ai / langchain