Calling Gemini Pro Vision with video

jsalfity / task_decomposition

Task Decomposition project

3 stars 0 forks source link

Calling Gemini Pro Vision with video #14

Closed jsalfity closed 6 months ago

jsalfity commented 6 months ago

Gemini Pro Vision states it can support video and text input through the python API, . However, when calling with a video, the server responds with:

400 * GenerateContentRequest.contents[0].parts[0].inline_data.mime_type: MIME type must be image/png, image/jpeg, image/webp, image/heic, or image/heif.

So... that means the python API can't accept videos?

How is this different than the HTTP request: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini#gemini-pro-vision

To reproduce, set the config to llm_model : 'gemini-pro-vision' with use_video: True. Then uncomment out

https://github.com/jsalfity/task_decomposition/blob/1991fc4034ee06a20ac65110d79e724a876cf7e8/task_decomposition/utils/querying.py#L130-L144

and comment out https://github.com/jsalfity/task_decomposition/blob/1991fc4034ee06a20ac65110d79e724a876cf7e8/task_decomposition/utils/querying.py#L147-L154.

Then python task_decomposition/analysis/query_LLM.py

bchen32 commented 6 months ago

The issue here is that the Google AI Studio API is not as feature rich as their Vertex AI API, so we'll have to use the VertexAI SDK which means we have to do things in Google Cloud land. The basic rundown is to make a GC Project, enable the VertexAI API in that project, upload the relevant video file to GCS. Then open the project's Cloud Shell, install the Python SDK with pip install "google-cloud-aiplatform>=1.38" and from there you can write a python script that sends videos to Gemini through Vertex AI. Here's a sample one

from vertexai.generative_models import GenerativeModel, Part

def prompt_video(project_id: str, location: str) -> str:
    # Initialize Vertex AI
    vertexai.init(project=project_id, location=location)
    # Load the model
    multimodal_model = GenerativeModel("gemini-1.0-pro-vision")
    # Query the model
    response = multimodal_model.generate_content(
        [
            Part.from_uri("gs://cloud-samples-data/video/animals.mp4", mime_type="video/mp4"), # Video, needs to be uploaded to GCS
            "Explain what's happening in this video", # Prompt
        ]
    )
    print(response)

prompt_video("august-apricot-415506", "us-central1") # probably different project_id

jsalfity commented 6 months ago

Progress: Set up google cloud through personal account. Cloned code ran via cloud shell. Main addition:

https://github.com/jsalfity/task_decomposition/blob/1b4cf39d9ba495d03284ea35b3f91699f092ad30/task_decomposition/utils/querying.py#L182-L186

Stored test data in GCS via a Cloud Storage --> Bucket --> 'task_decomposition_data' in the project=gen-lang-client-0368774908 project.

Error: Generic 500 internal service error when running the API.

jsalfity commented 6 months ago

Forgot to close. This all worked. Thank you, @bchen32