sahanbull / x5learn

Web frontend for X5GON project
Apache License 2.0
0 stars 0 forks source link

YouTube Transcript API #96

Open stefankreitmayer opened 5 years ago

stefankreitmayer commented 5 years ago

Problem

Using a headless browser is slow and CPU intensive. It's fine for prototyping but not ideal for production

Proposed solution

This seems promising

https://pypi.org/project/youtube-transcript-api/

Progress

I've tried it out on a video that we know has a transcript. It fails. See output below:

from youtube_transcript_api import YouTubeTranscriptApi YouTubeTranscriptApi.get_transcript(mbyG85GZ0PI) Traceback (most recent call last): File "", line 1, in NameError: name 'mbyG85GZ0PI' is not defined YouTubeTranscriptApi.get_transcript('mbyG85GZ0PI') Could not get the transcript for the video https://www.youtube.com/watch?v=mbyG85GZ0PI! This usually happens if one of the following things is the case:

  • subtitles have been disabled by the uploader
  • none of the language codes you provided are valid
  • none of the languages you provided are supported by the video
  • the video is no longer available.

If none of these things is the case, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues Traceback (most recent call last): File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 98, in get_transcript return _TranscriptParser(_TranscriptFetcher(video_id, languages, proxies).fetch()).parse() File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 163, in parse for xml_element in ElementTree.fromstring(self.plain_data) File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 163, in for xml_element in ElementTree.fromstring(self.plain_data) File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/html/init.py", line 130, in unescape if '&' not in s: TypeError: argument of type 'NoneType' is not iterable

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/Users/localadmin/anaconda2/envs/x5flaskheroku/lib/python3.5/site-packages/youtube_transcript_api/_api.py", line 105, in get_transcript raise YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id) youtube_transcript_api._api.CouldNotRetrieveTranscript: Could not get the transcript for the video https://www.youtube.com/watch?v=mbyG85GZ0PI! This usually happens if one of the following things is the case:

If none of these things is the case, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues

Next steps

If it doesn't work it's no use. We may still keep an eye on it while sticking with headless browser for now.

sahanbull commented 5 years ago

It works for the majority of the videos (The only one amongst many caltech videos that it didn't work for me was the Caltech machine learning series).

It worked for every other random caltech video and other video I tried out with (about 20 other videos.)

An alternative would be to use the headless browser in the event that this transcriber fails and we can also add this as an issue in the library github. This allows us get transcriptions of majority of youtube materials efficiently and rely on the headless browser for a minority of them.