jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.92k stars 329 forks source link

TranscriptsDisabled But it's not disabled (works locally, fails on Cloud machine) #303

Open atoonk opened 3 months ago

atoonk commented 3 months ago

To Reproduce

using youtube-transcript-api-0.6.2:

cat test.py 
from youtube_transcript_api import YouTubeTranscriptApi

print(YouTubeTranscriptApi.get_transcript('w8rYQ40C9xo'))

outputs:

python3 ./test.py 
Traceback (most recent call last):
  File "/root/border0-plugin/./test.py", line 3, in <module>
    print(YouTubeTranscriptApi.get_transcript('w8rYQ40C9xo'))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/border0-plugin/myenv2/lib/python3.11/site-packages/youtube_transcript_api/_api.py", line 137, in get_transcript
    return cls.list_transcripts(video_id, proxies, cookies).find_transcript(languages).fetch(preserve_formatting=preserve_formatting)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/border0-plugin/myenv2/lib/python3.11/site-packages/youtube_transcript_api/_api.py", line 71, in list_transcripts
    return TranscriptListFetcher(http_client).fetch(video_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/border0-plugin/myenv2/lib/python3.11/site-packages/youtube_transcript_api/_transcripts.py", line 48, in fetch
    self._extract_captions_json(self._fetch_video_html(video_id), video_id),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/border0-plugin/myenv2/lib/python3.11/site-packages/youtube_transcript_api/_transcripts.py", line 62, in _extract_captions_json
    raise TranscriptsDisabled(video_id)
youtube_transcript_api._errors.TranscriptsDisabled: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=w8rYQ40C9xo! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!

What code / cli command are you executing?

I am running

from youtube_transcript_api import YouTubeTranscriptApi
print(YouTubeTranscriptApi.get_transcript('w8rYQ40C9xo'))

Which Python version are you using?

Python 3.11.6

Which version of youtube-transcript-api are you using?

youtube-transcript-api-0.6.2

Expected behavior

Describe what you expected to happen. I expected to receive the english transcript I can see it in browser, see screenshot:

Screenshot 2024-07-17 at 2 56 23 PM

Actual behaviour

Traceback (most recent call last):
  File "/root/border0-plugin/./test.py", line 3, in <module>
    print(YouTubeTranscriptApi.get_transcript('w8rYQ40C9xo'))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/border0-plugin/myenv2/lib/python3.11/site-packages/youtube_transcript_api/_api.py", line 137, in get_transcript
    return cls.list_transcripts(video_id, proxies, cookies).find_transcript(languages).fetch(preserve_formatting=preserve_formatting)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/border0-plugin/myenv2/lib/python3.11/site-packages/youtube_transcript_api/_api.py", line 71, in list_transcripts
    return TranscriptListFetcher(http_client).fetch(video_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/border0-plugin/myenv2/lib/python3.11/site-packages/youtube_transcript_api/_transcripts.py", line 48, in fetch
    self._extract_captions_json(self._fetch_video_html(video_id), video_id),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/border0-plugin/myenv2/lib/python3.11/site-packages/youtube_transcript_api/_transcripts.py", line 62, in _extract_captions_json
    raise TranscriptsDisabled(video_id)
youtube_transcript_api._errors.TranscriptsDisabled: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=w8rYQ40C9xo! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!
Ibrahim-Faisal15 commented 3 months ago

Yes the issue is valid, but it seems that this does not show with the link, which Youtube gave us when we use the link from Share button.

jdepoix commented 3 months ago

Hi @atoonk, do you only have this issue with this specific video, or all videos you are trying to retrieve? I can retrieve the subtitles for that video without any issues, which usually means that you are being rate-limited by YouTube (which would also mean that this should happen for all videos).

SKVNDR commented 3 months ago

Hi @jdepoix, I encountered the same problem yesterday with every video I tried. Although I don't use the API frequently, I do access it a few times per day. I hope it's not some new restriction from YouTube. I experienced the same problem as @atoonk, and the issue is still present today.

Thanks a lot for your quick response and for this amazing tool; I really like it.

jdepoix commented 3 months ago

Hi @SKVNDR, then you're most definitely being blocked by YouTube. The only way to work around this is to change your IP address in any way (VPN, proxy, or assign a new IP if possible).

fleerdayo commented 3 months ago

I can confirm that YouTube is most likely blocking =/ It works from my local dev env but it doesn't work in production all things equal.

alimbekovKZ commented 3 months ago

I have the same problem. But I never use this library before, just firs try for along time

jdepoix commented 3 months ago

If you're running your code on a cloud machine it could be that (depending on your setup) you're getting assigned an IP from a pool that is shared with other machines. So the IP you're using could potentially be blocked without you doing anything. YouTube could also generally black list certain IPs that are known to belong to cloud providers (just a guess, I don't know if they actually do that!).

atoonk commented 3 months ago

Ah yes, i tried it from my laptop at home and it works fine now. And indeed, it affected all videos, which I why I thought it was a bug or new behaviour in YT api. So, I guess YouTube blocked me (this was on Digital ocean machine). Bummer, gotta find a way around that. Any docs on the ratelimit numbers or when folks get added? I only run this once every few weeks and only for a dozen videos or so. So bit surprised I was blocked. Unless it's all of digital ocean.

jdepoix commented 3 months ago

Since this is not an official API, there unfortunately is no information on rate limits and when or for how long you will get blocked. People have been reporting different things, so I don't feel like it is consistent either.

jdepoix commented 3 months ago

I will pin this issue and leave it open, since there are issues being opened due to this all the time. Feel free to discuss workarounds and share your experience on YouTubes blocking heuristics, but be aware that there is no proper fix here and probably never will be. That's the nature of using an unofficial API unfortunately.

SKVNDR commented 3 months ago

Same for me. I use a droplet on DigitalOcean, and YouTube probably blocked the IP from there, but using a proxy fixed the issue...

auspy commented 3 months ago

Same for me. I use a droplet on DigitalOcean, and YouTube probably blocked the IP from there, but using a proxy fixed the issue...

how did you create a proxy can you share the code. did you use a free proxy or paid? how did you obtain that proxy?

SKVNDR commented 3 months ago

Hi @auspy,

from youtube_transcript_api import YouTubeTranscriptApi  
YouTubeTranscriptApi.get_transcript(video_id, proxies={"https": "https://user:pass@domain:port"})

I'm using a paid proxy from smartproxy.com with the "Residential" offer. There are probably other better proxies available; I chose this one randomly.

atoonk commented 3 months ago

confirmed, using a proxy from my droplet worked. I used this to proxy traffic from my digital ocean droplet to my local laptop. https://docs.border0.com/docs/expose-a-http-proxy which will allow you to expose a proxy on localhost and have it egress on a separate machine (in my case my laptop)

transcript = YouTubeTranscriptApi.get_transcript(video_id, proxies={"https": "http://localhost:8080"})

can make a more details quick video if folks are interested in how to use that.

yourdesigncoza commented 3 months ago

Having the exact same issue, & also using DigitalOcean droplet

ZhimaoLin commented 3 months ago

Same here. Subscribed this issue.

auspy commented 2 months ago

confirmed, using a proxy from my droplet worked. I used this to proxy traffic from my digital ocean droplet to my local laptop. https://docs.border0.com/docs/expose-a-http-proxy which will allow you to expose a proxy on localhost and have it egress on a separate machine (in my case my laptop)

transcript = YouTubeTranscriptApi.get_transcript(video_id, proxies={"https": "http://localhost:8080"})

can make a more details quick video if folks are interested in how to use that.

sure would love a video on it. drop the link here

auspy commented 2 months ago

Hi @auspy,

from youtube_transcript_api import YouTubeTranscriptApi  
YouTubeTranscriptApi.get_transcript(video_id, proxies={"https": "https://user:pass@domain:port"})

I'm using a paid proxy from smartproxy.com with the "Residential" offer. There are probably other better proxies available; I chose this one randomly.

thank you for sharing. this surely looks like a cheap option but I was looking for something free. don't want to pay in initial stages of my project.

yourdesigncoza commented 2 months ago

confirmed, using a proxy from my droplet worked. I used this to proxy traffic from my digital ocean droplet to my local laptop. https://docs.border0.com/docs/expose-a-http-proxy which will allow you to expose a proxy on localhost and have it egress on a separate machine (in my case my laptop)

transcript = YouTubeTranscriptApi.get_transcript(video_id, proxies={"https": "http://localhost:8080"})

can make a more details quick video if folks are interested in how to use that.

sure would love a video on it. drop the link here

@auspy Would Appreciate a vid. or just more info. ::: I'm all new to proxies etc. seems most info. online is kinda for the more experienced :::

williamtkelley commented 2 months ago

Just ran across this issue today, glad I found this thread. I too am on Digital Ocean, running my code in a Docker container. Getting transcripts runs fine locally, but not on DO.

I would appreciate the video mentioned above, as proxies are new to me. If I use my localhost as a proxy, it means I need to leave the machine running 24/7 right? I mean, I guess that's obvious.

tuganbaev commented 2 months ago

Yep, same with me -- looks like youtube blocked many DO servers at once -- i didn't spent so much requests and I'm banned.

alimbekovKZ commented 2 months ago

I also use Digital Ocean droplet, i think they block IPs from DO servers. now I using google cloud functions.

BenjaminKobjolke commented 2 months ago

I can confirm that it is a problem with digital ocean servers being blocked. Using a proxy is the solutiion.

alimbekovKZ commented 2 months ago

Now this error also in google cloud functions.

ethan-0l commented 2 months ago

Blocked from dedicated OVH too

atikinkoon commented 2 months ago

Has anyone faced same issue on pythonanywhere?

0xRaduan commented 2 months ago

faced the same issue today in aws ec2

june-zeroxflow commented 2 months ago

same issue today on aws lambda

jhabscheid commented 2 months ago

Hetzner VPS are blocked too.

JiaShanJou commented 2 months ago

Same issue, but on Deepnote environment. Does anyone know how to change Deepnote's IP address or something about changing proxies in Deepnote? When I run the same code on my local environment, the transcription works fine. However, it seems like YouTube is blocking Deepnote's IP.

meera commented 2 months ago

@danielsanmartin I see that you have forked a repository to avoid IP Ban. Can you write instructions how to download CA_BUNDLE ?

Yuoter commented 2 months ago

I also have the same issue from yesterday. Does anyone know if youtube_transcript_api use any intermediate servers if proxy are not set explicitly?

Because in my case issue is very strange. youtube-transcript-api works well locally without proxies and with proxies, but when I setup it on Pythonanywhere, it stopped working without proxies and even with proxies.

At the same time when I make a direct request to YouTube public API endpoint https://www.youtube.com/api/timedtext with parameters that I extract via Chrome Web Console -> Network tab and use requests library, in such case it works both locally and at Pythonanywhere, with and without proxies.

What can be the issue? Might be it relates to headers that youtube_transcript_api generates?

danielsanmartin commented 2 months ago

Ho @meera. I used a proxy call from zyte.com to make calls to the YouTubeTranscriptApi class. Since it works via https, I needed to use the certificate. So far this is working. Using my fork, the code is:

from youtube_transcript_api import YouTubeTranscriptApi

YouTubeTranscriptApi.get_transcript('video id', ['language'], proxies={"http": "http://zyte-api-key:@api.zyte.com:8011/","https": "http://zyte-api-key:@api.zyte.com:8011/",}, verify='/path/to/zyte-ca.crt')

nick-barth commented 2 months ago

Same issue, but on Deepnote environment. Does anyone know how to change Deepnote's IP address or something about changing proxies in Deepnote? When I run the same code on my local environment, the transcription works fine. However, it seems like YouTube is blocking Deepnote's IP.

This may help you.

udede11 commented 2 months ago

Same problem. Not sure how it is possible but I can also confirm that the proxy that worked locally isn't working on the AWS Lambda function. I used brightdata isp one.

Praneeth-Pike commented 2 months ago

I deployed on Render. Same problem here. Proxying through smart proxy fixed it. Although smart proxy residential plan starts at $7 per GB.

GnAndradas commented 2 months ago

Ho @meera. I used a proxy call from zyte.com to make calls to the YouTubeTranscriptApi class. Since it works via https, I needed to use the certificate. So far this is working. Using my fork, the code is:

from youtube_transcript_api import YouTubeTranscriptApi

YouTubeTranscriptApi.get_transcript('video id', ['language'], proxies={"http": "http://zyte-api-key:@api.zyte.com:8011/","https": "http://zyte-api-key:@api.zyte.com:8011/",}, verify='/path/to/zyte-ca.crt')

bro.. is functional at now?...

Joe-hitthecode commented 2 months ago

Yh. Just used proxy guys. I solved the problem using NodeMaven proxy @ https://nodemaven.com/. just like this: proxy_url = f'http://{username}:{password}@{proxy_host}:{proxy_port}' transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=lang, proxies={'http': proxy_url, 'https': proxy_url})

ghost commented 2 months ago

Confirm, "Subtitles are disabled for this video" as response using Render.

GnAndradas commented 2 months ago

The best and fully functional option is this one: https://rapidapi.com/DataFanatic/api/youtube-media-downloader, it's also highly scalable and very fast. The free version only allows about 150 calls per month to their API, but it's useful. Additionally, it has been tested on two different hosting providers, Heroku and PythonAnywhere. Next, I need to test the proxy setup, but I haven't gotten that far yet.

meera commented 2 months ago

Has any one tried Youtube Data API to retrieve YouTube subtitles? What are pros and cons ? There is daily quota of 1000. How many requests can you fit in daily quota?

jamesflores commented 2 months ago

I experienced the same issue, looks like YouTube is blocking IPs. Mine is in AWS EC2. I have a Cloudflare Worker that does the job for now: https://github.com/jamesflores/youtube-subtitles-worker

AniketModi commented 2 months ago

Hi, We have also started facing issue in our company where instance is running on aws eks. Can there be a way to reach out to YT to get ip unblocked ?

OpeyemiSanusi commented 2 months ago

Hi, We have also started facing issue in our company where instance is running on aws eks. Can there be a way to reach out to YT to get ip unblocked ?

@AniketModi that's funny! so essentially you'd be telling Youtube - "hey i was scraping data which is against your policy, you blocked the ip and i want you to unblock it so i can do some more" - it's not possible, they wouldn't even see it. The library works by scraping the youtube video for the transcript which is against their policy. They don't even want you interacting with their video expect through YT.

atlas-comstock commented 2 months ago

same issue, youtube block the IP

jdepoix commented 2 months ago

Hi all, I am trying to improve the error message on this a bit, but to be able to do so I need to know the exact response that YouTube sends in such cases. Since I currently don't have a machine with a blocked IP available to try this on, it would be great if someone could run the following (on a machine with a blocked IP) and upload the resulting dump.html somewhere for me?

import requests
from youtube_transcript_api._transcripts import TranscriptListFetcher
html = TranscriptListFetcher(requests.Session())._fetch_html("vJEbP2Vdq2U")
with open("dump.html", "w+") as fp:
    fp.write(html)

Thank you! 🙏

iamscottweber commented 2 months ago

Ran your code on my AWS server. I'm having the same issue. I've uploaded the dump.html here

jdepoix commented 2 months ago

Thanks a lot @iamscottweber!

Interestingly, when rendering the HTML dump it includes the error message "Sign in to confirm you're not a bot". So this means that you might actually be able to continue scraping if you're signed in! You can do authenticated requests using Cookies, as explained in the README.

Could maybe someone who's currently blocked give this a try and see whether this allows them to continue scraping?

(Please note that I don't know if YouTube will ban your account at some point if you scrape too much, so it might be better to do this with an account you don't care about, just to be on the safe side)

0xRaduan commented 2 months ago

@jdepoix

Just curious - do you know what's the age limit for cookies?

OpeyemiSanusi commented 2 months ago

Thanks a lot @iamscottweber!

Interestingly, when rendering the HTML dump it includes the error message "Sign in to confirm you're not a bot". So this means that you might actually be able to continue scraping if you're signed in! You can do authenticated requests using Cookies, as explained in the README.

Could maybe someone who's currently blocked give this a try and see whether this allows them to continue scraping?

(Please note that I don't know if YouTube will ban your account at some point if you scrape too much, so it might be better to do this with an account you don't care about, just to be on the safe side)

Okay! this is pretty interesting.

Since the library is basically just a scraper they put a sign in block. However using cookies puts you at a risk of losing your account, if you wanted to do that you can build your own scraper or better still create a scraper that uses these free transcription websites.

I also have a automation that I am selling for a small fee, you can make a simple request to it and it would return the transcript. If anyone is interested let me know.