learningequality / ka-lite

KA Lite: lightweight web server for serving core Khan Academy content (videos and exercises) without needing internet connectivity
https://learningequality.org/ka-lite/
Other
458 stars 305 forks source link

Remove youtube-dl as fallback (because it isn't really a fallback) #5547

Closed benjaoming closed 6 years ago

benjaoming commented 6 years ago

Summary

image

After adding retry logic in #5545, it has occurred that Youtube-dl as a fallback has become meaningless. And already was!

  1. From a JSON API, we call fle_utils.internet.functions.am_i_online() to check if the central server is online. If the server doesn't respond with a status 200, we assume that we cannot download videos at all = no fallback available when the central server is down.
  2. If a video returns a 404 on our redirect, it seems that something is wrong anyways with the youtube ID, and as such, there's not a clear chance that youtube-dl would succeed anyways. We'd be interested in knowing this and fixing it in case it does happen, so I think it's fine that a user is bothered by this.
  3. If downloading videos results in a 500, we could use youtube-dl, assuming that there is an error in our video download script. But this is highly unlikely that it would happen without being actually a problem for the whole central server anyways, so we wouldn't reach this step normally.
  4. If there is a 502 (gateway error), we would never reach this step anyways!

So the only way that we would reach youtube-dl as a fallback, would be when our server goes to lunch while a user is downloaded. Added to that, each download would have to fail on its own, and we would not implement any retry patterns of HTTP status errors, which means that in order to use the youtube-dl fallback, we would water down the retry logic because of the 404, 502 and 500 errors should not wait+retry, but just fallback to youtube-dl.

Retry logic

The retry logic needs to retry when there are connection errors and read timeouts. This is quite normal when a connection is unstable or a user has to move a laptop while downloading.

This should prompt a pause in the download and not move on to trying youtube-dl with a subsequent "oh this also failed, now we will sleep for 10 seconds, then try the other download, then go to youtube-dl and then fail again".

Instead, these retry errors should prompt a normal 10 second pause, before silently failing again and not bothering with some youtube-dl fallback.

Benefits of dumping youtube-dl

Youtube-dl has a disk print of 7.5 MB. It's a very heavy dependency and doesn't at all match a normal dependency policy, since it's only imported once.

Removing it will also reduce the logic of our video download, which has long suffered from instability, quite simply because the code was hard to understand.

For the sake of "legacy mode", I would also suggest that a scraper of this caliber, which is subject to changes (because youtube changes a lot), would be assumed to break often and thus not very nice to have.

benjaoming commented 6 years ago

Removed dependency in #5545