shihabcodes / Gemini-YT-Transcript-Summarizer

A Python app using YouTube Transcript API & Google's Gemini Pro GenerativeAI for automatic summarization. Built with Streamlit, it allows users to input YouTube video links and receive detailed summaries, enhancing accessibility and efficiency.
https://ytgeminisummarizer.streamlit.app/
12 stars 6 forks source link

Some videos don't have a subtitles: app should get the transcript with whisper for example #2

Open hakkm opened 6 months ago

hakkm commented 6 months ago

we need to handle videos without subtitles.

mmuyakwa commented 2 months ago

The situation is as follows: Gemini is capable of transcribing videos on its own. However, this process requires the video to be downloaded once and then uploaded to Google via the Gemini API.

There's currently no known method of loading a video directly into Gemini using just the YouTube URL, without first saving it to your local hard drive.

Initially, my experience with this approach was very positive.

However, in recent weeks the API has claimed that it hasn't received or found a video to transcribe, despite successfully uploading videos.

I've actually been using this API to generate Python scripts as markdowns from tutorial videos, particularly from YouTubers who have the unfortunate habit of not providing GitHub links or reserving them for members-only areas. My need was more for gemini's vision capabilities. But it was perfectly capable of generating STT without much effort.

As long as the video was not longer than 15 minutes, it was useful for Vision-Capabilities and STT.

Regarding the issue of handling videos without subtitles, implementing Whisper or a similar solution could indeed be a viable option. However, it's worth noting that this would introduce additional complexity and potential resource constraints to the application.

In other words. Yes, it is possible. Not easy to implement and does not currently appear to be working on Gemini's end.

I know of no scenario where this can be done without downloading the video first.

mmuyakwa commented 2 months ago

Here a script I developed for this task:

import os
from dotenv import load_dotenv
import google.generativeai as genai
import sys
import time

load_dotenv(override=True)

# Initialize the Gemini API
api_key = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=api_key)

# Check if a file path was passed
if len(sys.argv) < 2:
    print("Please provide the path to the audio file.")
    sys.exit(1)

audio_file_path = sys.argv[1]

# Upload the audio file
audio_file = genai.upload_file(audio_file_path)

# Wait for the audio file to be processed
while audio_file.state.name == "PROCESSING":
    print(".", end="")
    time.sleep(10)
    audio_file = genai.get_file(audio_file.name)

# Tell Gemini to transcribe the audio file
prompt = "Listen carefully to the following audio file. Transcribe the spoken words accurately. Adjust the spelling accordingly. Remove filler words such as 'um' or similar. There should be no pauses in the text. The text should be easy to read and contain no linguistic or grammatical errors. In any case, generate complete, meaningful and comprehensible sentences from the content."
model = genai.GenerativeModel("models/gemini-1.5-pro-latest")
response = model.generate_content(
    [prompt, audio_file], request_options={"timeout": 600}
)

# Extract the filename from the file path
filename, file_extension = os.path.splitext(os.path.basename(audio_file_path))

# Save the result to a text file with the same name as the audio file
output_file = f"{filename}.txt"
with open(output_file, "w", encoding="utf-8") as f:
    f.write(response.text)

print(f"The transcription has been saved in the file '{output_file}'.")

# Delete the uploaded files from Gemini
genai.delete_file(audio_file.name)

transcribe_with_gemini.py Documentation

Overview

The transcribe_with_gemini.py script is designed to transcribe audio content from files using the Gemini AI model from GenAI. It takes an audio or video file as input, uploads it to GenAI, and uses a specific model to transcribe the spoken words into text. The script ensures the transcription is accurate, free of filler words, and grammatically correct. The final transcription is saved in a text file with the same name as the input file.

Requirements

How to Use

  1. Prepare Your Environment: Ensure Python and the GenAI SDK are installed on your system.

  2. Script Invocation: Run the script from the command line by passing the path to the audio or video file you want to transcribe as an argument.

    python transcribe_with_gemini.py path/to/your/file.mp3

    Replace path/to/your/file.mp3 with the actual path to your audio or video file.

  3. Supported File Types: The script supports various audio and video file formats. Ensure your file is in a compatible format that GenAI can process.

  4. Transcription Process:

    • The script first checks if a file path was provided.
    • It then uploads the file to GenAI.
    • A detailed prompt is sent to the Gemini model to ensure the transcription is accurate and free of filler words.
    • The model processes the audio content and returns the transcription.
    • The transcription is saved in a .txt file with the same name as the input file.
  5. Output: The transcription text file will be saved in the same directory as the script. A message will be printed to the console indicating the name of the output file.

  6. Cleanup: After the transcription is saved, the uploaded file is deleted from GenAI to ensure privacy and data security.

Notes

Conclusion

The transcribe_with_gemini.py script offers a convenient way to transcribe audio and video files using the power of AI. By automating the transcription process, it saves time and ensures high accuracy and readability of the output text.