Open hakkm opened 6 months ago
The situation is as follows: Gemini is capable of transcribing videos on its own. However, this process requires the video to be downloaded once and then uploaded to Google via the Gemini API.
There's currently no known method of loading a video directly into Gemini using just the YouTube URL, without first saving it to your local hard drive.
Initially, my experience with this approach was very positive.
However, in recent weeks the API has claimed that it hasn't received or found a video to transcribe, despite successfully uploading videos.
I've actually been using this API to generate Python scripts as markdowns from tutorial videos, particularly from YouTubers who have the unfortunate habit of not providing GitHub links or reserving them for members-only areas. My need was more for gemini's vision capabilities. But it was perfectly capable of generating STT
without much effort.
As long as the video was not longer than 15 minutes, it was useful for
Vision-Capabilities
andSTT
.
Regarding the issue of handling videos without subtitles, implementing Whisper or a similar solution could indeed be a viable option. However, it's worth noting that this would introduce additional complexity and potential resource constraints to the application.
In other words. Yes, it is possible. Not easy to implement and does not currently appear to be working on Gemini's end.
I know of no scenario where this can be done
without downloading
the video first.
Here a script I developed for this task:
import os
from dotenv import load_dotenv
import google.generativeai as genai
import sys
import time
load_dotenv(override=True)
# Initialize the Gemini API
api_key = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=api_key)
# Check if a file path was passed
if len(sys.argv) < 2:
print("Please provide the path to the audio file.")
sys.exit(1)
audio_file_path = sys.argv[1]
# Upload the audio file
audio_file = genai.upload_file(audio_file_path)
# Wait for the audio file to be processed
while audio_file.state.name == "PROCESSING":
print(".", end="")
time.sleep(10)
audio_file = genai.get_file(audio_file.name)
# Tell Gemini to transcribe the audio file
prompt = "Listen carefully to the following audio file. Transcribe the spoken words accurately. Adjust the spelling accordingly. Remove filler words such as 'um' or similar. There should be no pauses in the text. The text should be easy to read and contain no linguistic or grammatical errors. In any case, generate complete, meaningful and comprehensible sentences from the content."
model = genai.GenerativeModel("models/gemini-1.5-pro-latest")
response = model.generate_content(
[prompt, audio_file], request_options={"timeout": 600}
)
# Extract the filename from the file path
filename, file_extension = os.path.splitext(os.path.basename(audio_file_path))
# Save the result to a text file with the same name as the audio file
output_file = f"{filename}.txt"
with open(output_file, "w", encoding="utf-8") as f:
f.write(response.text)
print(f"The transcription has been saved in the file '{output_file}'.")
# Delete the uploaded files from Gemini
genai.delete_file(audio_file.name)
The transcribe_with_gemini.py
script is designed to transcribe audio content from files using the Gemini AI model from GenAI. It takes an audio or video file as input, uploads it to GenAI, and uses a specific model to transcribe the spoken words into text. The script ensures the transcription is accurate, free of filler words, and grammatically correct. The final transcription is saved in a text file with the same name as the input file.
Prepare Your Environment: Ensure Python and the GenAI SDK are installed on your system.
Script Invocation: Run the script from the command line by passing the path to the audio or video file you want to transcribe as an argument.
python transcribe_with_gemini.py path/to/your/file.mp3
Replace path/to/your/file.mp3
with the actual path to your audio or video file.
Supported File Types: The script supports various audio and video file formats. Ensure your file is in a compatible format that GenAI can process.
Transcription Process:
.txt
file with the same name as the input file.Output: The transcription text file will be saved in the same directory as the script. A message will be printed to the console indicating the name of the output file.
Cleanup: After the transcription is saved, the uploaded file is deleted from GenAI to ensure privacy and data security.
The transcribe_with_gemini.py
script offers a convenient way to transcribe audio and video files using the power of AI. By automating the transcription process, it saves time and ensures high accuracy and readability of the output text.
we need to handle videos without subtitles.