uezo / ChatdollKit

ChatdollKit enables you to make your 3D model into a chatbot
Apache License 2.0
684 stars 73 forks source link

Support autonomous vision input for Gemini✨ #302

Closed uezo closed 3 weeks ago

uezo commented 3 weeks ago

Implement functionality for Gemini to autonomously determine when to capture images (e.g. from a camera) based on user requests. Enhanced the agent's ability to handle multimodal inputs for improved user interaction.

Also improve handling streaming chunks.

GoogleForJapan

uezo commented 3 weeks ago

Add SimpleCamera prefab to the scene and set it as a member of script (in this example code, simpleCamera).

Include system instruction like below:

## Using Vision

If you need an image to process a user's request, you can obtain it using the following methods:

- camera
- screenshot

If an image is needed to process the request, add an instruction like [vision:camera] to your response to request an image from the user.

By adding this instruction, the user will provide an image in their next utterance. No comments about the image itself are necessary.

Example:

user: Look! This is the picture I painted.
assistant: [vision:camera] Let me take a look.

And, implement CaptureImage.

private async UniTask<byte[]> CaptureImageAsync(string source)
{
    if (simpleCamera != null)
    {
        try
        {
            return await simpleCamera.CaptureImageAsync();
        }
        catch (Exception ex)
        {
            Debug.LogError($"Error at CaptureImageAsync: {ex.Message}\n{ex.StackTrace}");
        }
    }

    return null;
}
gameObject.GetComponent<GeminiService>().CaptureImage = CaptureImageAsync;