Llama 3.2 with system prompt and image payload doesn't generated structured output

p0deje commented 4 days ago

I'm not sure if it's an issue with langchain-aws or boto3, but when llama-3.2-vision is used with a system prompt, a structured output and an image payload, the structured output fails to work and the JSON is returned in message contents. When at least one of the pieces is omitted (e.g. no system prompt or no image) - the structured output works perfectly fine.

To reproduce, here is the sample script - https://gist.github.com/p0deje/23231cd28ed61f1acf30fce07cbf16cd

$ python test.py
=========================
Works fine when used without image!
[{'type': 'tool_use', 'name': 'Response', 'input': {'result': 'true'}, 'id': 'tooluse_JuG13WrfQ8qEopRpFiY9GA'}]
result=True
=========================

=========================
Does not work with image:
Sure, here is a JSON for a function call with its proper arguments that best answers the given prompt:

{
    "name": "Response",
    "parameters": {
        "result": true
    }
}
None
=========================

3coins commented 3 days ago

@p0deje Thanks for raising this issue, it is great that you included the code sample. I can reproduce your issue, but need more data to decide if this is a problem with the ChatBedrock implementation, Bedrock service, or meta model itself. Can you try the code with the image directly with Bedrock converse API and share your results?

p0deje commented 17 hours ago

@3coins I've tested using boto3 and it seems to work fine with system prompt + image + tools - https://gist.github.com/p0deje/aaae813ceaf2bf506c75f1cf551a921e

$ python boto3.py
[{'toolUse': {'toolUseId': 'tooluse_3YBWfQl9ROyoppF0tB0drA', 'name': 'response', 'input': {'result': 'True'}}}]

However, depending on the image, it was sometimes getting different results and sometimes there were not tools called. For example, if you replace image URL (line 13) to "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQmqSrv025igdJmWN7lK3h3fFPwuW76FO_nlA&s" and image format to jpeg (line 27), the same code suddenly starts to produce a response w/ tools:

$ python boto3.py
[{'text': 'The prompt is asking whether the statement "2+2=4" is true or false. To answer this, we need to evaluate the expression "2+2" and compare it with 4.\n\nThe correct function call for this prompt would be:\n\n{"name": "response", "parameters": {"result": true}}\n\nThis function call indicates that the result of the expression "2+2" is indeed equal to 4, which is a true statement.'}]

I played more with different images, both png and jpeg and some of them work fine while others are consistently failing to produce tool output. I don't see a clear pattern there.

Given the issue reproduces in boto3, would you advise I raise the issue there or some other place? Also, would it be possible to work around this issue in langchain-aws considering the tool JSON is still present in the text response?

langchain-ai / langchain-aws

Llama 3.2 with system prompt and image payload doesn't generated structured output #285