robcharlwood / you-judge

An experimental Google AppEngine app using natural language processing to process user sentiment analysis on YouTube content and comments.
MIT License
3 stars 1 forks source link

Look into Google Speech to Text for videos with missing transcripts #11

Open robcharlwood opened 5 years ago

robcharlwood commented 5 years ago

Not all videos have a transcript or captions available. In these cases it would be nice to be able to generate one for them and then use the new transcript in sentiment analysis.

Google Speech to Text API should do the job. A little pricey, but hey I'm having fun. :)

robcharlwood commented 5 years ago

Ok, so for an initial prototype this looks like a little to much work. We cant easily download the audio streams for videos.

Even though pytube gives us access to audio streams, we cant download them due to limitations in urlfetch on appengine. There is a 60 second time limit on requests via urlfetch and a urlfetch is only capable of handling up to 32MB worth of data. Some of these audio streams could well be much larger in size.

This is plausible functionality though - once we find a way of being able to download and store the data to cloudstorage, we can then easily run it through cloud speech to text on the slow running poll service. We'd then gradually build up a transcript file as the audio is processed. we could then run the completed transcripts through cloud natural language for sentiment analysis.

There are only a couple of ways around this issue on Appengine Standard as far as I can tell.

  1. We use sockets and force urlfetch to use sockets via the GAE_USE_SOCKETS_HTTPLIB env variable. This would get around the 60 second deadline, however it might not get around Appengine standard's memory limits (again depending on file size)
  2. We setup a compute instance or something and run the process there as a background task. Our Appengine Instance would then simply poll the cloudstorage bucket for the completed transcript.

I think option 2 is the most realistic solution, but that's going to take more time than I have right now. Marking this as on hold until I get some extra time to add this in.

If anybody has any better ideas, please add to this thread. :)