jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.54k stars 279 forks source link

abnormal behaviour running the transcript on heroku #279

Closed oandreazza closed 2 months ago

oandreazza commented 2 months ago

DO NOT DELETE THIS! Please take the time to fill this out properly. I am not able to help you if I do not know what you are executing and what error messages you are getting. If you are having problems with a specific video make sure to include the video id.

To Reproduce

Steps to reproduce the behavior:

What code / cli command are you executing?

I'm running the script to get transcript from youtube video using this code

from flask import Flask, request, Response
from youtube_transcript_api import YouTubeTranscriptApi

app = Flask(__name__)

@app.route('/transcript/<video_id>')
def get_transcript(video_id):
    try:
        # Buscando o transcript do vídeo em português, com fallback para inglês
        transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['pt','en'])

        # Concatena o texto de cada parte do transcript
        transcript_text = '\n'.join([entry['text'] for entry in transcript])

        # Retorna o transcript como texto puro
        return Response(transcript_text, mimetype='text/plain')
    except Exception as e:
        return Response(f"Erro ao buscar o transcript: {e}", status=400, mimetype='text/plain')

Which Python version are you using?

Python 3.12.3

Which version of youtube-transcript-api are you using?

youtube-transcript-api 0.6.2

Expected behavior

Receive the transcript when I access MY_URL/get_transcript/video-id

Actual behaviour

When I run the code on my localhost, I can get all the times the transcript, but when I try on heroku, I often get the following error:

Could not retrieve a transcript for the video https://www.youtube.com/watch?v=mNgDS3GG8GQ! This is most likely caused by:

Subtitles are disabled for this video

But as mentioned, if a run the same video-id on localhost I can get the transcript and the subtitles are enable for this video.

adnanhassan23 commented 2 months ago

I'm also facing this issue on Google Cloud Run. I think it's IP address related issue. Have you resolved it? @jdepoix can you please help?

oandreazza commented 2 months ago

I'm also facing this issue on Google Cloud Run. I think it's IP address related issue. Have you resolved it? @jdepoix can you please help?

No, it's quite random. I try, get error and then try again in few minutes and it's ok.

jdepoix commented 2 months ago

Hi @adnanhassan23 @oandreazza, I've had multiple reports with similar problems in the past. Generally speaking YouTube will (temporarly) block certain IP addresses if they receive too much (or rather very high frequency) traffic from it. Depending on the cloud provider/product you're using, they will usually assign you an IP from a pool (unless specified otherwise), so my guess is that IPs get blocked more easily, since the assigned traffic is shared between multiple agents. The only way I see to work around this is to have a pool of static IPs which are only used by you. Depending on the product/provider you're using, you should be able to setup something like that, but it will usually be more expensive. Another alternative would be to route the traffic through an VPN or another Proxy where you cycle through a set of IPs as they get blocked.

You could further verify this, by catching the error when it occurs and doing a GET request to the YouTube URL https://www.youtube.com/watch?v=VIDEO_ID to verify whether all traffic to YouTube is blocked.

I will close this now, as this isn't really an issue which can be solved by this module, but feel free to discuss further and share what you did and what works / didn't work for you! 🙂