Support autonomous vision input for Claude✹

Implement functionality for Claude to autonomously determine when to capture images (e.g. from a camera) based on user requests. Enhanced the agent's ability to handle multimodal inputs for improved user interaction.

Also support Function Calling. But currently, only the first content in the response is processed. Ensure that the prompt controls for content to include only tool_use.

uezo / ChatdollKit

Support autonomous vision input for Claude✹ #304