The situation is as follows: Gemini is capable of transcribing videos on its own. However, this process requires the video to be downloaded once and then uploaded to Google via the Gemini API.

There's currently no known method of loading a video directly into Gemini using just the YouTube URL, without first saving it to your local hard drive.

Initially, my experience with this approach was very positive.

However, in recent weeks the API has claimed that it hasn't received or found a video to transcribe, despite successfully uploading videos.

I've actually been using this API to generate Python scripts as markdowns from tutorial videos, particularly from YouTubers who have the unfortunate habit of not providing GitHub links or reserving them for members-only areas. My need was more for gemini's vision capabilities. But it was perfectly capable of generating STT without much effort.

As long as the video was not longer than 15 minutes, it was useful for Vision-Capabilities and STT.

Regarding the issue of handling videos without subtitles, implementing Whisper or a similar solution could indeed be a viable option. However, it's worth noting that this would introduce additional complexity and potential resource constraints to the application.

In other words. Yes, it is possible. Not easy to implement and does not currently appear to be working on Gemini's end.

I know of no scenario where this can be done without downloading the video first.

Here a script I developed for this task:

import os
from dotenv import load_dotenv
import google.generativeai as genai
import sys
import time

load_dotenv(override=True)

# Initialize the Gemini API
api_key = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=api_key)

# Check if a file path was passed
if len(sys.argv) < 2:
    print("Please provide the path to the audio file.")
    sys.exit(1)

audio_file_path = sys.argv[1]

# Upload the audio file
audio_file = genai.upload_file(audio_file_path)

# Wait for the audio file to be processed
while audio_file.state.name == "PROCESSING":
    print(".", end="")
    time.sleep(10)
    audio_file = genai.get_file(audio_file.name)

# Tell Gemini to transcribe the audio file
prompt = "Listen carefully to the following audio file. Transcribe the spoken words accurately. Adjust the spelling accordingly. Remove filler words such as 'um' or similar. There should be no pauses in the text. The text should be easy to read and contain no linguistic or grammatical errors. In any case, generate complete, meaningful and comprehensible sentences from the content."
model = genai.GenerativeModel("models/gemini-1.5-pro-latest")
response = model.generate_content(
    [prompt, audio_file], request_options={"timeout": 600}
)

# Extract the filename from the file path
filename, file_extension = os.path.splitext(os.path.basename(audio_file_path))

# Save the result to a text file with the same name as the audio file
output_file = f"{filename}.txt"
with open(output_file, "w", encoding="utf-8") as f:
    f.write(response.text)

print(f"The transcription has been saved in the file '{output_file}'.")

# Delete the uploaded files from Gemini
genai.delete_file(audio_file.name)

transcribe_with_gemini.py Documentation

Overview

The transcribe_with_gemini.py script is designed to transcribe audio content from files using the Gemini AI model from GenAI. It takes an audio or video file as input, uploads it to GenAI, and uses a specific model to transcribe the spoken words into text. The script ensures the transcription is accurate, free of filler words, and grammatically correct. The final transcription is saved in a text file with the same name as the input file.

Requirements

Python 3.x
GenAI Python SDK installed
An account with GenAI and access to the Gemini model

How to Use

Prepare Your Environment: Ensure Python and the GenAI SDK are installed on your system.
Script Invocation: Run the script from the command line by passing the path to the audio or video file you want to transcribe as an argument.
```
python transcribe_with_gemini.py path/to/your/file.mp3
```
Replace path/to/your/file.mp3 with the actual path to your audio or video file.
Supported File Types: The script supports various audio and video file formats. Ensure your file is in a compatible format that GenAI can process.
Transcription Process:
- The script first checks if a file path was provided.
- It then uploads the file to GenAI.
- A detailed prompt is sent to the Gemini model to ensure the transcription is accurate and free of filler words.
- The model processes the audio content and returns the transcription.
- The transcription is saved in a .txt file with the same name as the input file.
Output: The transcription text file will be saved in the same directory as the script. A message will be printed to the console indicating the name of the output file.
Cleanup: After the transcription is saved, the uploaded file is deleted from GenAI to ensure privacy and data security.

Notes

The script requires an active internet connection to upload files and communicate with GenAI.
Ensure you have sufficient permissions to access and upload the files to GenAI.
The script includes a timeout option set to 600 seconds for the transcription process, which can be adjusted based on file size and complexity.

Conclusion

The transcribe_with_gemini.py script offers a convenient way to transcribe audio and video files using the power of AI. By automating the transcription process, it saves time and ensures high accuracy and readability of the output text.

shihabcodes / Gemini-YT-Transcript-Summarizer