Open stefankreitmayer opened 5 years ago
It works for the majority of the videos (The only one amongst many caltech videos that it didn't work for me was the Caltech machine learning series).
It worked for every other random caltech video and other video I tried out with (about 20 other videos.)
An alternative would be to use the headless browser in the event that this transcriber fails and we can also add this as an issue in the library github. This allows us get transcriptions of majority of youtube materials efficiently and rely on the headless browser for a minority of them.
Problem
Using a headless browser is slow and CPU intensive. It's fine for prototyping but not ideal for production
Proposed solution
This seems promising
https://pypi.org/project/youtube-transcript-api/
Progress
I've tried it out on a video that we know has a transcript. It fails. See output below:
If none of these things is the case, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues Traceback (most recent call last): File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 98, in get_transcript return _TranscriptParser(_TranscriptFetcher(video_id, languages, proxies).fetch()).parse() File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 163, in parse for xml_element in ElementTree.fromstring(self.plain_data) File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 163, in
for xml_element in ElementTree.fromstring(self.plain_data)
File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/html/init.py", line 130, in unescape
if '&' not in s:
TypeError: argument of type 'NoneType' is not iterable
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "", line 1, in
File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 105, in get_transcript
raise YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id)
youtube_transcript_api._api.CouldNotRetrieveTranscript: Could not get the transcript for the video https://www.youtube.com/watch?v=mbyG85GZ0PI! This usually happens if one of the following things is the case:
If none of these things is the case, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues
Next steps
If it doesn't work it's no use. We may still keep an eye on it while sticking with headless browser for now.