jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.54k stars 279 forks source link

getting error Too Many Requests for url #256

Closed Areesha1801 closed 4 months ago

Areesha1801 commented 4 months ago

I am trying to retrieve transcripts of some YouTube videos. I enlisted URLs in my csv file and after extraction tried to write extracted contents in txt files separately for each file. Getting the following output:

Error extracting transcript for V1: Could not retrieve a transcript for the video "URL mentioned" Client Error: Too Many Requests for url: "URL mentioned" This is most likely caused by:

Request to YouTube failed: 9kpl7AtE03c

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem! following is my code:

import csv import os import re from langchain_community.document_loaders import YoutubeLoader from langchain_community.document_loaders.youtube import TranscriptFormat

def sanitize_filename(filename): return re.sub(r'[\/*?:"<>|]', "", filename)

def document_to_string(document):

Assuming that the Document object has a method or attribute to get its text

# You might need to adjust this depending on the actual structure of the Document object
return document.text if hasattr(document, 'text') else str(document)

def extract_and_save_transcripts(csv_filepath):

Ensure the data directory exists

data_dir = 'data'
os.makedirs(data_dir, exist_ok=True)
with open(csv_filepath, mode='r', newline='', encoding='utf-8-sig') as csvfile:
    reader = csv.DictReader(csvfile)
    headers = next(reader)  # Read the header row
    # print(f"CSV Headers: {headers}")  # This will print the actual headers of your CSV
    csvfile.seek(0)  # Reset the read position of the CSV file
    next(reader)  # Skip the header row
    for row in reader:
        session = row['Session']
        video_id = row['VideoID']
        video_name = sanitize_filename(row['VideoName'])
        url = row['VideoURL']

        # Initialize the YoutubeLoader with the video URL
        loader = YoutubeLoader.from_youtube_url(
            url,
            language=["de"],  # Specify other languages if necessary
            translation="en",
            transcript_format=TranscriptFormat.TEXT,
        )

        # Load the transcript
        try:
            transcript = loader.load()
            if isinstance(transcript, list):
                transcript = '\n'.join(document_to_string(doc) for doc in transcript)
            elif not isinstance(transcript, str):
                transcript = document_to_string(transcript)
            # Defining the filename for the transcript text file
            filename = os.path.join(data_dir, f"{session}_{video_id}_{video_name}.txt")
            # Save the transcript to a text file
            with open(filename, 'w', encoding='utf-8') as text_file:
                text_file.write(transcript)
            print(f"Transcript saved: {filename}")
        except Exception as e:
            print(f"Error extracting transcript for {video_id}: {e}")

csv_filepath = 'data/VideoURLs.csv' extract_and_save_transcripts(csv_filepath)

jdepoix commented 4 months ago

Hi @Areesha1801, I don't know YoutubeLoader and how it works, so I can't speak to that.

Using this module I can run the following without a problem:

YouTubeTranscriptApi.list_transcripts("9kpl7AtE03c").find_transcript(["de"]).translate("en").fetch()

so it doesn't seem like it is an issue related to this module.

Can you open the URL in the Exception? Maybe the from_youtube_url method doesn't parse the video ID from the video URL correctly. You should probably isolate the issue a bit further, to make sure where it is coming from, since there's a few different modules at play here.