Open atoonk opened 3 months ago
@jdepoix
Just curious - do you know what's the age limit for cookies?
there is no right answer to be honest - depends on Youtube or the site, could last anywhere from a day to whatever. Just watch your program closely or build a failsafe around such failures.
I have it working with a secure proxy without cookies and the fork that @danielsanmartin provided. I'll watch it and possibly incorporate cookies.
I experienced the same issue, looks like YouTube is blocking IPs. Mine is in AWS EC2. I have a Cloudflare Worker that does the job for now: https://github.com/jamesflores/youtube-subtitles-worker
@jamesflores I thought about doing the same. I am worried that I will get my Cloudflare account in trouble. We might run into the same issue a few months down the line when YouTube blocks the Cloudflare worker IP address. What do you think?
My code using YouTubeTranscriptApi works locally but fails on the server with this error: Failed to retrieve transcript: Subtitles are disabled for this video I’ve confirmed the subtitles are available and have the same library version on both environments. I also had trouble with proxy settings. Any suggestions or solutions would be appreciated!
Thanks for creating this thread, I was tearing my hair out to figure out what the hell happened to my AWS lambda function!
So following the suggestions from before, In prod: I tried using a proxy when fetching the transcript (caveat: http proxy), but I still get the same "TranscriptsDisabled" error.
Locally, it works fine. Any idea what this could be about? How is it technically possible that Youtube bans an ip address proxy that works locally but not on AWS servers?
Appreciate any insight you guys might have
I had exactly the same issue on prod with lambda functions. It could be something that lambda is putting on the header that gives away it’s an AWS lambda and they might have banned all Lambdas using that. But this is completely a speculation, haven’t tested it yet.
I used the rapid api solution presented above. Works good for now 👍
On Sun, Aug 11, 2024 at 2:07 PM Satya @.***> wrote:
Thanks for creating this thread, I was tearing my hair out to figure out what the hell happened to my AWS lambda function!
So following the suggestions from before, In prod: I tried using a proxy when fetching the transcript (caveat: http proxy), but I still get the same "TranscriptsDisabled" error.
Locally, it works fine. Any idea what this could be about? How is it technically possible that Youtube bans an ip address proxy that works locally but not on AWS servers?
Appreciate any insight you guys might have
— Reply to this email directly, view it on GitHub https://github.com/jdepoix/youtube-transcript-api/issues/303#issuecomment-2282720324, or unsubscribe https://github.com/notifications/unsubscribe-auth/A62HCCVWR7O6PFIV5RPIHZTZQ5AWLAVCNFSM6AAAAABLBOWBE6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBSG4ZDAMZSGQ . You are receiving this because you commented.Message ID: @.***>
Thanks a lot @iamscottweber!
Interestingly, when rendering the HTML dump it includes the error message "Sign in to confirm you're not a bot". So this means that you might actually be able to continue scraping if you're signed in! You can do authenticated requests using Cookies, as explained in the README.
Could maybe someone who's currently blocked give this a try and see whether this allows them to continue scraping?
(Please note that I don't know if YouTube will ban your account at some point if you scrape too much, so it might be better to do this with an account you don't care about, just to be on the safe side)
@jdepoix
I ran your code on my aws ec2 linux instance and generated a dump.html
I then opened the html on my local workstation and got the "sign in". Before I signed in, the transcript was not accessible. After the sign-in the transcript was. So in essence that part of the test worked. I wanted to then try using your cookie methodology, however the extension you have listed in your readme is no longer available on chrome. Do you have another extension you would recommend, or perhaps another way to get to the cookie info so I can continue testing the methodology?
I'd like to understand how the YouTubeTranscript website obtains transcript support, as it must be deployed somewhere to provide this functionality.
@udede11 The youtube rapidapi website only supports up to 150 requests per month, so I need a permanent solution for obtaining transcripts consistently.
I have been running the youtube transcript api for my startup over months. We solved the transcript disabled problem long time ago and we have wrote some script in house that make sure that it never break on us. If you are interested in this solution you can reach out to me joeslamie@gmail.com
I have been running the youtube transcript api for my startup over months. We solved the transcript disabled problem long time ago and we have wrote some script in house that make sure that it never break on us. If you are interested in this solution you can reach out to me joeslamie@gmail.com
Send you an email!
Hi @Joe-hitthecode,
Just a quick note—I’ve emailed you at joeslamie@gmail.com regarding the YouTube Transcript API solution. Looking forward to your response!
Hi @Joe-hitthecode,
Just a quick note—I’ve emailed you at joeslamie@gmail.com regarding the YouTube Transcript API solution. Looking forward to your response!
Can you pls email to me as well. I am also currently using same API and facing same issue while running the code on AWS EKS but working well when tried to run on my local.
lol Meera. It is not a Scam. This is my LinkedInd: https://www.linkedin.com/in/joe-georgeo-slamie-413b7a170/. I am also busy so can't reply very fast as you will need
can you reply me @Joe-hitthecode
I am have a lot of emails. I am preparing a general response for everyone. Like that I don't reply people individually. Part of the solution include firstly trying to use proxy, which I mentioned here before.
@Joe-hitthecode You are going to post the solution here?
I am going off my system for a while so let me just go over how we are managing this expected problem. Before I go into it I just want to put it out there that the api we are using is basically employing web scrapping and btw these solutions are bound to be troublesome, however the biggest way we handled it was to use proxy but in a more skillful way. Before during anything else try a proxy that is easy to configure like https://nodemaven.com/ after you get a practical understanding of how to configure proxy then you can move on to using free proxy like https://www.croxyproxy.com/ which is a little complex. If you are successful with these steps, use the domain you get from your proxy provider like this:
proxy_url = f'http://{username}:{password}@{proxy_host}:{proxy_port}' transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=lang, proxies={'http': proxy_url, 'https': proxy_url})
If you cloned the repository and looked at the source code you will see that the place where the error is raised is in the _errors.py file: specifically this error message: class TranscriptsDisabled(CouldNotRetrieveTranscript): CAUSE_MESSAGE = 'Subtitles are disabled for this video' - When you called the .get_transcript method it creates a TranscriptListFetcher class and calls the .fetch method, below in the source code
class TranscriptListFetcher(object): def init(self, http_client): self._http_client = http_client
def fetch(self, video_id):
return TranscriptList.build(
self._http_client,
video_id,
self._extract_captions_json(self._fetch_video_html(video_id), video_id),
)
def _extract_captions_json(self, html, video_id):
splitted_html = html.split('"captions":')
if len(splitted_html) <= 1:
if video_id.startswith('http://') or video_id.startswith('https://'):
raise InvalidVideoId(video_id)
if 'class="g-recaptcha"' in html:
raise TooManyRequests(video_id)
if '"playabilityStatus":' not in html:
raise VideoUnavailable(video_id)
raise TranscriptsDisabled(video_id)
captions_json = json.loads(
splitted_html[1].split(',"videoDetails')[0].replace('\n', '')
).get('playerCaptionsTracklistRenderer')
if captions_json is None:
raise TranscriptsDisabled(video_id)
if 'captionTracks' not in captions_json:
raise NoTranscriptAvailable(video_id)
return captions_json
The .fetch method calls the self._extract_caption_json method and there is where the error is. The error is raised when these three conditions are checked and programmatically when these conditions doesn't failed the code illusively raise the TranscriptDisabled method which in some cases could be because of some other reason other than transcript disabled. However the reason this is happening is because your IP is blocked or something. So there is no magic bullet or secret sauce, you have to use a proxy or some type of vpn service that hide away your ip before fetching the transcript data.
ps: My Backend is running on pythonanywhere.com
@satyajit-bagchi you'll have to proxy your https requests, since the YouTube requests are done using https. Setting up a proxy for http won't do anything.
@Joe-hitthecode I'll try to add a more explicit exception for this type of error when I find time to do so. This should allow for catching them and falling back to a proxy or rotating through a pool of IPs as they get banned.
Thanks for pointing that out @jdepoix :). I now have it working for me on the cloud
To everyone else: For those in the starting stages of their projects, webshare offers a free 10 proxies with the SOCKS5 protocol. You can use a socks5 proxy out of the box with youtube-transcript-api. Just pass it to the proxies dict: https://stackoverflow.com/questions/12601316/how-to-make-python-requests-work-via-socks-proxy https://github.com/jdepoix/youtube-transcript-api?tab=readme-ov-file#proxy
I am facing the same issue. It is not working in the cloud server.
I am facing the same issue. It is not working in the cloud server.
You need to signup with a proxy to fix this.
If you are using the SOCKS5 protocol, make sure to install pysocks. pip install pysocks and you are good to go
Thanks for pointing that out @jdepoix :). I now have it working for me on the cloud
To everyone else: For those in the starting stages of their projects, webshare offers a free 10 proxies with the SOCKS5 protocol. You can use a socks5 proxy out of the box with youtube-transcript-api. Just pass it to the proxies dict: https://stackoverflow.com/questions/12601316/how-to-make-python-requests-work-via-socks-proxy https://github.com/jdepoix/youtube-transcript-api?tab=readme-ov-file#proxy
Excellent, worked.
Though still evaluating using the captions api with the Oauth directly, https://[developers.google.com/resources/api-libraries/documentation/youtube/v3/python/latest/youtube_v3.captions.html](https://developers.google.com/resources/api-libraries/documentation/youtube/v3/python/latest/youtube_v3.captions.html)
Thanks for pointing that out @jdepoix :). I now have it working for me on the cloud
To everyone else: For those in the starting stages of their projects, webshare offers a free 10 proxies with the SOCKS5 protocol. You can use a socks5 proxy out of the box with youtube-transcript-api. Just pass it to the proxies dict: https://stackoverflow.com/questions/12601316/how-to-make-python-requests-work-via-socks-proxy https://github.com/jdepoix/youtube-transcript-api?tab=readme-ov-file#proxy
@satyajit-bagchi its not working on Azure.
I ran into this on google cloud run. I tried dataimpulse with a $1/GB pay-as-you-go plan and it worked for me:
transcript = YouTubeTranscriptApi.get_transcript(video_id, proxies={"https": f"https://{dataimpulse_login}:{dataimpulse_password}@gw.dataimpulse.com:823"})
Not 100% reliable so failover path to smartproxy mentioned above by @SKVNDR
We've noticed this issue cropping up more in the past week, but interestingly it is not happening for all videos. Sometimes the same video will fail, and then succeed. Are others seeing this behavior? Does that still indicate that Youtube is blocking/rate limiting?
Yes, I've noticed a similar behavior where the same video is blocked even with a proxy. It will sometimes fail and other times work as expected. But adding the proxy has helped immensely and I'm assuming the fails happen to be on an IP address that is already rate limited.
I am going off my system for a while so let me just go over how we are managing this expected problem. Before I go into it I just want to put it out there that the api we are using is basically employing web scrapping and btw these solutions are bound to be troublesome, however the biggest way we handled it was to use proxy but in a more skillful way. Before during anything else try a proxy that is easy to configure like https://nodemaven.com/ after you get a practical understanding of how to configure proxy then you can move on to using free proxy like https://www.croxyproxy.com/ which is a little complex. If you are successful with these steps, use the domain you get from your proxy provider like this:
proxy_url = f'http://{username}:{password}@{proxy_host}:{proxy_port}' transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=lang, proxies={'http': proxy_url, 'https': proxy_url})
If you cloned the repository and looked at the source code you will see that the place where the error is raised is in the _errors.py file: specifically this error message: class TranscriptsDisabled(CouldNotRetrieveTranscript): CAUSE_MESSAGE = 'Subtitles are disabled for this video' - When you called the .get_transcript method it creates a TranscriptListFetcher class and calls the .fetch method, below in the source code
class TranscriptListFetcher(object): def init(self, http_client): self._http_client = http_client
def fetch(self, video_id): return TranscriptList.build( self._http_client, video_id, self._extract_captions_json(self._fetch_video_html(video_id), video_id), ) def _extract_captions_json(self, html, video_id): splitted_html = html.split('"captions":') if len(splitted_html) <= 1: if video_id.startswith('http://') or video_id.startswith('https://'): raise InvalidVideoId(video_id) if 'class="g-recaptcha"' in html: raise TooManyRequests(video_id) if '"playabilityStatus":' not in html: raise VideoUnavailable(video_id) raise TranscriptsDisabled(video_id) captions_json = json.loads( splitted_html[1].split(',"videoDetails')[0].replace('\n', '') ).get('playerCaptionsTracklistRenderer') if captions_json is None: raise TranscriptsDisabled(video_id) if 'captionTracks' not in captions_json: raise NoTranscriptAvailable(video_id) return captions_json
The .fetch method calls the self._extract_caption_json method and there is where the error is. The error is raised when these three conditions are checked and programmatically when these conditions doesn't failed the code illusively raise the TranscriptDisabled method which in some cases could be because of some other reason other than transcript disabled. However the reason this is happening is because your IP is blocked or something. So there is no magic bullet or secret sauce, you have to use a proxy or some type of vpn service that hide away your ip before fetching the transcript data.
ps: My Backend is running on pythonanywhere.com
Jup thats exactly the problem I noticed. Having also problems on AWS. I mean youtube is returning something even with the videoDetails
prop, but it is not complete as the crucial part is missing: playerCaptionsTracklistRenderer
So I guess the only solution is right now using proxy/VPN/dynamic IP or the official YT-api
I use yt-dlp as a failover to download the audio then send to my whisper server for transcription. It was failing as well until I added a proxy to it. Appears YouTube blacklisted all known IP ranges from providers
Thanks guys the proxy works. Should i implement proxy rotating logic?
I made a webshare account to get access to Socks5 proxies. I have tried multiple of their proxies and pip installed & imported pysocks as @Joe-hitthecode suggested and keep getting the following exception:
NOTE: I am running this all locally, not in cloud.
Exception message: SOCKSHTTPSConnectionPool(host='www.youtube.com', port=443): Max retries exceeded with url: /watch?v=kvlWtA136FM (Caused by NewConnectionError('<urllib3.contrib.socks.SOCKSHTTPSConnection object at 0x000001F1FFCEABA0>: Failed to establish a new connection: 0x04: Host unreachable'))
is anybody else experiencing this or does anyone know a solution?
@aj-bei I think Your not providing webshare credentials proxies proxies = { 'http': 'socks5://user:pass@proxy_ip:proxy_port', 'https': 'socks5://user:pass@proxy_ip:proxy_port' }
def generate_video_transcript(video_id): try:
response = YouTubeTranscriptApi.get_transcript(video_id, proxies=proxies)
if your alread doing it,try to use different proxy ip from proxy ip list which has status working
What is proxy? how to use this in ubuntu machine
Just ran across this issue today, glad I found this thread. I too am on Digital Ocean, running my code in a Docker container. Getting transcripts runs fine locally, but not on DO.
I would appreciate the video mentioned above, as proxies are new to me. If I use my localhost as a proxy, it means I need to leave the machine running 24/7 right? I mean, I guess that's obvious.
Did you create the Docker container yourself? I am interested in running this from my home server.
@aj-bei
I think Your not providing webshare credentials proxies
proxies = {
'http': 'socks5://user:pass@proxy_ip:proxy_port', 'https': 'socks5://user:pass@proxy_ip:proxy_port'
}
Define a function to fetch transcripts using proxies
def generate_video_transcript(video_id):
try: # Fetch transcript with the YouTubeTranscriptApi response = YouTubeTranscriptApi.get_transcript(video_id, proxies=proxies)
if your alread doing it,try to use different proxy ip from proxy ip list which has status working
I am not even running on the cloud and all my proxies still don't work
@aj-bei I think Your not providing webshare credentials proxies proxies = {
'http': 'socks5://user:pass@proxy_ip:proxy_port', 'https': 'socks5://user:pass@proxy_ip:proxy_port'
}
Define a function to fetch transcripts using proxies
def generate_video_transcript(video_id):
try: # Fetch transcript with the YouTubeTranscriptApi response = YouTubeTranscriptApi.get_transcript(video_id, proxies=proxies)
if your alread doing it,try to use different proxy ip from proxy ip list which has status working
I am not even running on the cloud and all my proxies still don't work
If your not using in cloud,you will get transcript directly using it in local without any problem.
Hello @jdepoix , i am using webshare socks5 proxies
proxies = { 'http': f'socks5://{username}:{password}@{host}:{port}', 'https': f'socks5://{username}:{password}@{host}:{port}' }
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id, proxies=proxies)
i am still getting this error on aws lambda
ERROR- SOCKSHTTPSConnectionPool(host='www.youtube.com', port=443): Max retries exceeded with url: /watch?v=WTOm65IZneg (Caused by NewConnectionError('<urllib3.contrib.socks.SOCKSHTTPSConnection object at 0x7fe5786f6510>: Failed to establish a new connection: All offered SOCKS5 authentication methods were rejected')) - WTOm65IZneg
I use a tool that can download transcripts one by one from any channel with this script built in. It works for some time and then it stops, but right after I switch a country on my paid VPN it works again, then it stops loading again until I switch to another country and so on. After some time, it starts working with the same IPs for the limited amount of videos again. So if it does not work for you, make sure your proxy/VPN is actually working well
If you encounter SOCKSHTTPSConnectionPool error, try socks5h not socks5. I have succeed. https://stackoverflow.com/questions/12601316/how-to-make-python-requests-work-via-socks-proxy
I use VPN, it works for a while, but then got the same "subtitles are disabled for this video" again. I need transcripts for hundreds of videos.
hundreds
@corngk is it a one time thing or is continuous. I have an automation that runs all you have to do is make a request to the link. Send me an email at opeyemisanusi@gmail.com
To get this working reliably in production, adding a proxy layer is essential. If you need help, feel free to reach out: https://linktr.ee/clearcode
I am hoping this won't get misused, But this is a working solution for free. https://gist.github.com/Ashes47/f03d8f8dfd024783a8a34ba34141d6ec
Hi all, I've been facing the same problem. I'm using a AWS lambda to get transcriptions and the response is always the same: no subtitles for this video. I'll apply a residencial proxy from smartproxy and I'll bring the results here after.
It worked like a charm!
Can this solution be built in the tool itself as default? @jdepoix
[Solved] I am using Digital Ocean and the problem still exists.
Edit: I set up Tor and it works now.
# general tor proxies
proxies = {
'http': 'socks5h://127.0.0.1:9050',
'https': 'socks5h://127.0.0.1:9050'
}
# I did not want to set up Tor on my local Windows
video_id = self.extract_video_id(url)
if platform.system() == 'Linux':
transcript = YouTubeTranscriptApi.get_transcript(video_id, proxies=proxies)
else:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
pip install pysocks
sudo apt update
sudo apt install tor
sudo service tor start
Problems with using TOR
To Reproduce
using youtube-transcript-api-0.6.2:
outputs:
What code / cli command are you executing?
I am running
Which Python version are you using?
Python 3.11.6
Which version of youtube-transcript-api are you using?
youtube-transcript-api-0.6.2
Expected behavior
Describe what you expected to happen. I expected to receive the english transcript I can see it in browser, see screenshot:
Actual behaviour