reconsumeralization / AutoGem

Agent Framework for Gemini Pro
Apache License 2.0
2 stars 0 forks source link

Sweep: Research the google Gemini pro models api and google gemini vision pro models api and create a client to work with and interprit our api clls to the model correctly. #4

Closed reconsumeralization closed 10 months ago

reconsumeralization commented 10 months ago

Details

Sweep: Research the google Gemini pro models api and google gemini vision pro models api and create a client to work with and interprit our api clls to the model correctly. Google Gemini Pro Models API

Overview

The Google Gemini Pro Models API is a cloud-based API that provides access to a variety of pre-trained Gemini models, including models for tasks such as image classification, object detection, and text classification. The API is designed to be easy to use, even for developers who don't have experience with deep learning.

Getting Started

To get started with the Google Gemini Pro Models API, you'll need to:

  1. Create a Google Cloud Platform project.
  2. Enable the Gemini Pro Models API.
  3. Install the Gemini Pro Models API client library.
  4. Create an instance of the GeminiProModelsServiceClient.

Code Sample

The following code sample shows you how to use the Google Gemini Pro Models API to classify an image:

from google.cloud import gemini_pro_models

# Create a client.
client = gemini_pro_models.GeminiProModelsServiceClient()

# Set the name of the model you want to use.
model_name = "projects/[PROJECT_ID]/locations/[LOCATION]/models/[MODEL_ID]"

# Set the path to the image file you want to classify.
image_file_path = "path/to/image.jpg"

# Read the image file.
with open(image_file_path, "rb") as image_file:
    image_bytes = image_file.read()

# Create the request.
request = gemini_pro_models.PredictRequest(
    name=model_name, instances=[{"b64": image_bytes}]
)

# Make the request.
response = client.predict(request=request)

# Get the results.
results = response.predictions

# Print the results.
for result in results:
    print(result)

Google Gemini Vision Pro Models API

Overview

The Google Gemini Vision Pro Models API is a cloud-based API that provides access to a variety of pre-trained Gemini models for computer vision tasks, such as image classification, object detection, and image segmentation. The API is designed to be easy to use, even for developers who don't have experience with deep learning.

Getting Started

To get started with the Google Gemini Vision Pro Models API, you'll need to:

  1. Create a Google Cloud Platform project.
  2. Enable the Gemini Vision Pro Models API.
  3. Install the Gemini Vision Pro Models API client library.
  4. Create an instance of the GeminiVisionProModelsServiceClient.

Code Sample

The following code sample shows you how to use the Google Gemini Vision Pro Models API to classify an image:

from google.cloud import gemini_vision_pro_models

# Create a client.
client = gemini_vision_pro_models.GeminiVisionProModelsServiceClient()

# Set the name of the model you want to use.
model_name = "projects/[PROJECT_ID]/locations/[LOCATION]/models/[MODEL_ID]"

# Set the path to the image file you want to classify.
image_file_path = "path/to/image.jpg"

# Read the image file.
with open(image_file_path, "rb") as image_file:
    image_bytes = image_file.read()

# Create the request.
request = gemini_vision_pro_models.PredictRequest(
    name=model_name, instances=[{"b64": image_bytes}]
)

# Make the request.
response = client.predict(request=request)

# Get the results.
results = response.predictions

# Print the results.
for result in results:
    print(result)

Interpreting API Calls

The output of the Google Gemini Pro Models API and Google Gemini Vision Pro Models API is a list of Prediction objects. Each Prediction object contains a list of Label objects. Each Label object contains the following information:

You can use this information to interpret the results of your API call. For example, if you are using the API to classify an image, you can use the category field to determine the class of the image. You can use the confidence field to determine how confident the API is in its prediction. And you can use the bounding_box field to locate the object in the image.

Conclusion

The Google Gemini Pro Models API and Google Gemini Vision Pro Models API are powerful tools for developers who want to use deep learning models in their applications. The APIs are easy to use, even for developers who don't have experience with deep learning. from future import annotations

import base64 import os import pdb import random import re import time from io import BytesIO from typing import Any, Dict, List, Mapping, Union

import google.generativeai as genai import httpx import requests from google.ai.generativelanguage import Content, Part from google.api_core.exceptions import InternalServerError from google.generativeai import ChatSession from openai import OpenAI, _exceptions, resources from openai._qs import Querystring from openai._types import NOT_GIVEN, NotGiven, Omit, ProxiesTypes, RequestOptions, Timeout, Transport from openai.types.chat import ChatCompletion from openai.types.chat.chat_completion import ChatCompletionMessage, Choice from openai.types.completionusage import CompletionUsage from PIL import Image from proto.marshal.collections.repeated import RepeatedComposite from pydash import max from typing_extensions import Self, override

from autogen.agentchat.contrib.img_utils import _to_pil, get_image_data

from autogen.token_count_utils import count_token

class GeminiClient: """ summary _extendedsummary TODO: this Gemini implementation does not support the following features yet:

def oai_content_to_gemini_content(content: Union[str, List]) -> List: """Convert content from OAI format to Gemini format""" rst = [] if isinstance(content, str): rst.append(Part(text=content)) return rst

assert isinstance(content, list)

for msg in content:
    if isinstance(msg, dict):
        assert "type" in msg, f"Missing 'type' field in message: {msg}"
        if msg["type"] == "text":
            rst.append(Part(text=msg["text"]))
        elif msg["type"] == "image_url":
            b64_img = get_image_data(msg["image_url"]["url"])
            img = _to_pil(b64_img)
            rst.append(img)
        else:
            raise ValueError(f"Unsupported message type: {msg['type']}")
    else:
        raise ValueError(f"Unsupported message type: {type(msg)}")
return rst

def concat_parts(parts: List[Part]) -> List: """Concatenate parts with the same type. If two adjacent parts both have the "text" attribute, then it will be joined into one part. """ if not parts: return []

concatenated_parts = []
previous_part = parts[0]

for current_part in parts[1:]:
    if previous_part.text != "":
        previous_part.text += current_part.text
    else:
        concatenated_parts.append(previous_part)
        previous_part = current_part

if previous_part.text == "":
    previous_part.text = "empty"  # Empty content is not allowed.
concatenated_parts.append(previous_part)

# TODO: handle inline_data, function_call, function_response

return concatenated_parts

def oai_messages_to_gemini_messages(messages: list[Dict[str, Any]]) -> list[dict[str, Any]]: """Convert messages from OAI format to Gemini format. Make sure the "user" role and "model" role are interleaved. Also, make sure the last item is from the "user" role. """ prev_role = None rst = [] curr_parts = [] for i, message in enumerate(messages): parts = oai_content_to_gemini_content(message["content"]) role = "user" if message["role"] in ["user", "system"] else "model"

    if prev_role is None or role == prev_role:
        curr_parts += parts
    elif role != prev_role:
        rst.append(Content(parts=concat_parts(curr_parts), role=prev_role))
        curr_parts = parts
    prev_role = role

# handle the last message
rst.append(Content(parts=concat_parts(curr_parts), role=role))

# The Gemini is restrict on order of roles, such that
# 1. The messages should be interleaved between user and model.
# 2. The last message must be from the user role.
# We add a dummy message "continue" if the last role is not the user.
if rst[-1].role != "user":
    rst.append(Content(parts=oai_content_to_gemini_content("continue"), role="user"))

# TODO: as many LLM/LMM models are not as smart as OpenAI models, we need
# to discuss how to design GroupChat and API protocol to make sure different
# models can be used together with consistent behaviors.

return rst

def count_gemini_tokens(ans: Union[str, Dict, List], model_name: str) -> int:

ans is OAI format in oai_messages

raise NotImplementedError(
    "Gemini's count_tokens function is not implemented yet in Google's genai. Please revisit!"
)

if isinstance(ans, str):
    model = genai.GenerativeModel(model_name)
    return model.count_tokens(ans)  # Error occurs here!
elif isinstance(ans, dict):
    if "content" in ans:
        # Content dict
        return count_gemini_tokens(ans["content"], model_name)
    if "text" in ans:
        # Part dict
        return count_gemini_tokens(ans["text"], model_name)
    else:
        raise ValueError(f"Unsupported message type: {type(ans)}")
elif isinstance(ans, list):
    return sum([count_gemini_tokens(msg, model_name) for msg in ans])
else:
    raise ValueError(f"Unsupported message type: {type(ans)}")

def _to_pil(data: str) -> Image.Image: """ Converts a base64 encoded image data string to a PIL Image object. This function first decodes the base64 encoded string to bytes, then creates a BytesIO object from the bytes, and finally creates and returns a PIL Image object from the BytesIO object. Parameters: data (str): The base64 encoded image data string. Returns: Image.Image: The PIL Image object created from the input data. """ return Image.open(BytesIO(base64.b64decode(data)))

def get_image_data(image_file: str, use_b64=True) -> bytes: if image_file.startswith("http://") or image_file.startswith("https://"): response = requests.get(image_file) content = response.content elif re.match(r"data:image/(?:png|jpeg);base64,", image_file): return re.sub(r"data:image/(?:png|jpeg);base64,", "", image_file) else: image = Image.open(image_file).convert("RGB") buffered = BytesIO() image.save(buffered, format="PNG") content = buffered.getvalue()

if use_b64:
    return base64.b64encode(content).decode("utf-8")
else:
    return content
Checklist - [X] Create `src/gemini_client.py` ✓ https://github.com/reconsumeralization/AutoGem/commit/a983c3aa4c0849f7a1bb9ee52b08364c1f5c4a03 [Edit](https://github.com/reconsumeralization/AutoGem/edit/sweep/research_the_google_gemini_pro_models_ap/src/gemini_client.py) - [X] Running GitHub Actions for `src/gemini_client.py` ✓ [Edit](https://github.com/reconsumeralization/AutoGem/edit/sweep/research_the_google_gemini_pro_models_ap/src/gemini_client.py) - [X] Create `src/utils.py` ✓ https://github.com/reconsumeralization/AutoGem/commit/3f5bcc12d54c60b071439c37ca4f509d24477c41 [Edit](https://github.com/reconsumeralization/AutoGem/edit/sweep/research_the_google_gemini_pro_models_ap/src/utils.py) - [X] Running GitHub Actions for `src/utils.py` ✓ [Edit](https://github.com/reconsumeralization/AutoGem/edit/sweep/research_the_google_gemini_pro_models_ap/src/utils.py) - [X] Modify `README.md` ✓ https://github.com/reconsumeralization/AutoGem/commit/177d73e830d9f859ad278870ac20d3fd160ac923 [Edit](https://github.com/reconsumeralization/AutoGem/edit/sweep/research_the_google_gemini_pro_models_ap/README.md) - [X] Running GitHub Actions for `README.md` ✓ [Edit](https://github.com/reconsumeralization/AutoGem/edit/sweep/research_the_google_gemini_pro_models_ap/README.md)
sweep-ai[bot] commented 10 months ago

🚀 Here's the PR! #5

See Sweep's progress at the progress dashboard!
Sweep Basic Tier: I'm using GPT-4. You have 5 GPT-4 tickets left for the month and 3 for the day. (tracking ID: 53a52f465b)

For more GPT-4 tickets, visit our payment portal. For a one week free trial, try Sweep Pro (unlimited GPT-4 tickets).

[!TIP] I'll email you at reconsumeralization@gmail.com when I complete this pull request!


Actions (click)

GitHub Actions✓

Here are the GitHub Actions logs prior to making any changes:

Sandbox logs for fb2064a
Checking README.md for syntax errors... ✅ README.md has no syntax errors! 1/1 ✓
Checking README.md for syntax errors...
✅ README.md has no syntax errors!

Sandbox passed on the latest main, so sandbox checks will be enabled for this issue.


Step 1: 🔎 Searching

I found the following snippets in your repository. I will now analyze these snippets and come up with a plan.

Some code snippets I think are relevant in decreasing order of relevance (click to expand). If some file is missing from here, you can mention the path in the ticket description. https://github.com/reconsumeralization/AutoGem/blob/223b35693570c56b37ae0335af5f983cff270aa3/README.md#L1-L1

Step 2: ⌨️ Coding

Ran GitHub Actions for a983c3aa4c0849f7a1bb9ee52b08364c1f5c4a03:

Ran GitHub Actions for 3f5bcc12d54c60b071439c37ca4f509d24477c41:

--- 
+++ 
@@ -10,4 +10,59 @@
 python -m unittest discover tests
 ```

+### Using the `GeminiClient` Class
+
+The `GeminiClient` class provides an interface to the Google Gemini Pro Models API and Google Gemini Vision Pro Models API. Here's how you can use it:
+
+#### Setup
+
+Before using the `GeminiClient`, ensure you have installed the necessary dependencies:
+
+```shell
+pip install google-cloud-gemini-pro-models google-cloud-gemini-vision-pro-models
+```
+
+You must also configure your Google Cloud authentication by setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your service account key file:
+
+```shell
+export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account-file.json"
+```
+
+#### Initializing the Client
+
+To initialize the `GeminiClient`, provide your Google Cloud API key:
+
+```python
+from src.gemini_client import GeminiClient
+
+client = GeminiClient(api_key='YOUR_API_KEY')
+```
+
+#### Making Prediction Requests
+
+To make prediction requests, use the `predict_with_gemini_pro_models` or `predict_with_gemini_vision_pro_models` methods. Provide the model name and the path to the image you wish to classify:
+
+```python
+# Predict with Gemini Pro Models
+results = client.predict_with_gemini_pro_models('model_name', 'path/to/image.jpg')
+
+# Predict with Gemini Vision Pro Models
+results = client.predict_with_gemini_vision_pro_models('model_name', 'path/to/image.jpg')
+```
+
+#### Interpreting the Results
+
+The methods return a list of dictionaries, each representing a prediction result. Here's how to interpret these results:
+
+```python
+for result in results:
+    print(f"Category: {result['category']}, Confidence: {result['confidence']}, Bounding Box: {result['bounding_box']}")
+```
+
+### Known Issues and Limitations
+
+- The current implementation does not support streaming predictions.
+- Only prediction requests with single instances (images) are supported; batch predictions are not yet implemented.
+- The API may impose limits on the number of requests per minute or other usage restrictions.
+
 Adjust the command based on the project's programming language and chosen testing framework.

Ran GitHub Actions for 177d73e830d9f859ad278870ac20d3fd160ac923:


Step 3: 🔁 Code Review

I have finished reviewing the code for completeness. I did not find errors for sweep/research_the_google_gemini_pro_models_ap.


🎉 Latest improvements to Sweep:
  • New dashboard launched for real-time tracking of Sweep issues, covering all stages from search to coding.
  • Integration of OpenAI's latest Assistant API for more efficient and reliable code planning and editing, improving speed by 3x.
  • Use the GitHub issues extension for creating Sweep issues directly from your editor.

💡 To recreate the pull request edit the issue title or description. To tweak the pull request, leave a comment on the pull request.Something wrong? Let us know.

This is an automated message generated by Sweep AI.