jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.81k stars 315 forks source link

Inconsistent get_transcript results #60

Closed jamirian071498 closed 4 years ago

jamirian071498 commented 4 years ago

When I repeatedly call "get_transcript" with the same video id, sometimes I get back the transcript, while other times I get back a "VideoUnavailable" error. As far as I can tell, this can happen with any youtube video. I am using version 0.3.1.

jdepoix commented 4 years ago

Are you doing those calls in a short amount of time, or over a bigger time span? And about how many calls are we approximately talking? While I never was able to replicate this, multiple users have reported that they ran into YouTube rate limits, when doing many requests in a short amount of time. Unfortunately there's nothing this module can do about limitations set up by YouTube itself. However, maybe I could probably add a custom Exception for this if I could find out what the actual response is which YouTube sends in this case, but as I said I just can't replicate this.

joacosaralegui commented 4 years ago

Hey guys. I see how this may be a follow up from this issue. As far as i was able to determine, YouTube tends to block this kind of continuous requests on STATIC IP servers.

In my case for example I had a Google Cloud VM running with a fixed external IP, and every "x" days of requests (around 2000 requests per day, for 3 or 4 days) it got banned. Then i would switch the IP of the server and it was good to go for another 4 days, and then it was blocked and had to come back to the original IP and so forth.

This is not the case with our local IPs as usually the ISP gives you a dynamic IP or hides it behind a double NAT, so maybe thats the reason you can't replicate the error.

What I did to fix it was to setup a Google Cloud VM whitout external IP, and make a Google Cloud NAT to manage that VM connection to internet, so now the requests go out with several different IPS, managed by Google. So far, it has worked ok.

So my recommendation would be to try to avoid static IP's and that kind of stuff. Dont know if that's your case but hope it helps!

jdepoix commented 4 years ago

@joacosaralegui thanks for the insights!

jamirian071498 commented 4 years ago

I do not believe my issue has to do with the YouTube API rate limits. The rate limit is apparently 10000 requests per day and I am using somewhere around 5 requests per hour.

One thing I have noticed is that the API works consistently when I call it locally, but not when I call it from an AWS lambda. If possible any help would be greatly appreciated!

jdepoix commented 4 years ago

So if you're only having this issue when using AWS, I could imagine that AWS is using an IP address for those requests which is also used by others. Therefore your rate limit is shared between multiple users. However, this could be difficult to verify, or work around. Is there a way to have lambdas use a dedicated IP for external calls? 🤔

jdepoix commented 4 years ago

@jamirian071498 I will close this now as I don't think there's a problem at hand which can be solved by this module, but feel free to share your learnings if you find out more about this problem.

HappyNinja2 commented 2 years ago

@jdepoix is it possible to get a youtube api key and use it in the library to avoid rate limits? If not, which part of the code should be modified to support it?

jdepoix commented 2 years ago

@HappyNinja2 this is not possible, as this module does not use the official YouTube API. It simply calls the endpoints which are used by the YouTube frontend.

mgoldenbe commented 1 year ago

@jdepoix @HappyNinja2 I believe YouTube API does not let one obtain captions of videos one does not own. I wonder whether there is a way to get that privilege...

mgoldenbe commented 1 year ago

@HappyNinja2 this is not possible, as this module does not use the official YouTube API. It simply calls the endpoints which are used by the YouTube frontend.

@jdepoix Does this mean that YouTube can shut down this module any moment by simply setting up CORS_ORIGIN policy to only allow requests from youtube.com? I am curious why they have not done it yet. More importantly, in this case it seems dangerous to base a business on this module. What are your thoughts on this?

jdepoix commented 1 year ago

@mgoldenbe There are many ways in which YouTube could prevent this module from working if that was what they are after. This hasn't happened so far, although this module has been around for a while, but I can't make any promise to whether it will stay that way. I am afraid you will have to decide for yourself whether that is too big of a risk for you to use this in production.

mgoldenbe commented 1 year ago

@jdepoix Given that the official YouTube API does not allow one to obtain captions of YouTube videos one does not own, have you heard of any option, including paid ones, that one can build upon with reasonable certainty?

jdepoix commented 1 year ago

@mgoldenbe No, unfortunately not. Technically speaking, I don't think there is a way to work around these uncertainties, so I don't think there ever will be an option without these limitations, unless YouTube decides to add these options to their official API.

mgoldenbe commented 9 months ago

There was a comment here about being able to get captions using YouTube API. Is there a reason it was removed?

ahoebeke commented 9 months ago

Removed because wrong, sorry 🙏 could have edited. Still nothing after looking several hours. (As for rate limiting, tried using brightdata rotating IP proxy but youtube blocked.)

mgoldenbe commented 9 months ago

@ahoebeke What happens when you run the code that you posted?

ahoebeke commented 9 months ago

the API provides metadata about a video's captions but it requires additional oauth (owner of the video) if you want to use the api call to download the transcripts

sorry for waking up the old thread like this 😓

mgoldenbe commented 9 months ago

@ahoebeke OK. Thank you.