Open p0deje opened 4 days ago
@p0deje Thanks for raising this issue, it is great that you included the code sample. I can reproduce your issue, but need more data to decide if this is a problem with the ChatBedrock implementation, Bedrock service, or meta model itself. Can you try the code with the image directly with Bedrock converse API and share your results?
@3coins I've tested using boto3 and it seems to work fine with system prompt + image + tools - https://gist.github.com/p0deje/aaae813ceaf2bf506c75f1cf551a921e
$ python boto3.py
[{'toolUse': {'toolUseId': 'tooluse_3YBWfQl9ROyoppF0tB0drA', 'name': 'response', 'input': {'result': 'True'}}}]
However, depending on the image, it was sometimes getting different results and sometimes there were not tools called. For example, if you replace image URL (line 13) to "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQmqSrv025igdJmWN7lK3h3fFPwuW76FO_nlA&s" and image format to jpeg
(line 27), the same code suddenly starts to produce a response w/ tools:
$ python boto3.py
[{'text': 'The prompt is asking whether the statement "2+2=4" is true or false. To answer this, we need to evaluate the expression "2+2" and compare it with 4.\n\nThe correct function call for this prompt would be:\n\n{"name": "response", "parameters": {"result": true}}\n\nThis function call indicates that the result of the expression "2+2" is indeed equal to 4, which is a true statement.'}]
I played more with different images, both png and jpeg and some of them work fine while others are consistently failing to produce tool output. I don't see a clear pattern there.
Given the issue reproduces in boto3, would you advise I raise the issue there or some other place? Also, would it be possible to work around this issue in langchain-aws considering the tool JSON is still present in the text response?
I'm not sure if it's an issue with langchain-aws or boto3, but when llama-3.2-vision is used with a system prompt, a structured output and an image payload, the structured output fails to work and the JSON is returned in message contents. When at least one of the pieces is omitted (e.g. no system prompt or no image) - the structured output works perfectly fine.
To reproduce, here is the sample script - https://gist.github.com/p0deje/23231cd28ed61f1acf30fce07cbf16cd