Problem

Using a headless browser is slow and CPU intensive. It's fine for prototyping but not ideal for production

Proposed solution

This seems promising

https://pypi.org/project/youtube-transcript-api/

Progress

I've tried it out on a video that we know has a transcript. It fails. See output below:

from youtube_transcript_api import YouTubeTranscriptApi YouTubeTranscriptApi.get_transcript(mbyG85GZ0PI) Traceback (most recent call last): File "", line 1, in NameError: name 'mbyG85GZ0PI' is not defined YouTubeTranscriptApi.get_transcript('mbyG85GZ0PI') Could not get the transcript for the video https://www.youtube.com/watch?v=mbyG85GZ0PI! This usually happens if one of the following things is the case:

subtitles have been disabled by the uploader

none of the language codes you provided are valid

none of the languages you provided are supported by the video

the video is no longer available.

If none of these things is the case, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues Traceback (most recent call last): File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 98, in get_transcript return _TranscriptParser(_TranscriptFetcher(video_id, languages, proxies).fetch()).parse() File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 163, in parse for xml_element in ElementTree.fromstring(self.plain_data) File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 163, in for xml_element in ElementTree.fromstring(self.plain_data) File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/html/init.py", line 130, in unescape if '&' not in s: TypeError: argument of type 'NoneType' is not iterable

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 105, in get_transcript raise YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id) youtube_transcript_api._api.CouldNotRetrieveTranscript: Could not get the transcript for the video https://www.youtube.com/watch?v=mbyG85GZ0PI! This usually happens if one of the following things is the case:

subtitles have been disabled by the uploader
none of the language codes you provided are valid
none of the languages you provided are supported by the video
the video is no longer available.

If none of these things is the case, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues

Next steps

If it doesn't work it's no use. We may still keep an eye on it while sticking with headless browser for now.

sahanbull / x5learn

YouTube Transcript API #96

Problem

Proposed solution

Progress

Next steps