Introduce optimization cache

kelson42 commented 4 years ago

Like in MWoffliner, see https://github.com/openzim/mwoffliner/issues/996.

If the option to deal with S3 should be availbel at youtube2zim level the implementation should be made generic at python_scraperlib level so we can reuse it for an other project.

rgaudin commented 4 years ago

The goal of this issue is to avoid redoing work that has been done already. The way this scraper works is by doing the following:

compute a list of video Ids from a request
download all videos into individual files, using a specific format based on request (webm, mp4)
recompress video files using different settings based on low_quality param.
create a ZIM

Videos are not updated on youtube side once they are published so what we want it whenever we want to create a new version of a ZIM file, we'd only have to do steps 2 & 3 for new videos.

For that, we are setting up a dedicated online storage based on AWS S3. We're using a different provider but the API is similar.

In the new wanted behavior, for each video, we should:

check if we already have it in S3 (see below)
if we do, download it from S3 and move to next
if we don't, download from youtube and process as before then upload to S3.

This is the same mechanism as for other scraper (ATM only mwoffliner implements it).

In S3, you store files using two different information:

a bucket which is like a root folder. We'll use a single one for all the videos we scrape.
a key which is a string identifying the file. It can be anything. People usually use filenames, or path-like names or full URL for instance. mwoffliner uses URL of the resource it caches as key.
the server provides an ETag for all file. This is generated server-side. The documentation doesn't enforces an implementation so this can be anything. Fortunately, most web servers use md5 hash as it's speedy and reliable.
A lot of web servers don't provide Etag for their resources though.
S3 has a different algorithm for files that has been uploaded using its multi-part mechanism. so depending on the source server and files, ETag might not be appropriate.

Now, here's the huge difference with youtube:

our resources are not URLs. While a wikipedia image has a fixed URL, it's not the case for our videos. Thus, we'll identify videos using their YoutubeID.
as we don't have URLs, we don't have Etags.
ETags wouldn't matter much as we know that video don't get updated once published
we create different formats from the same source so we need to record that information. We don't have video xyz01b but we have xyz01b/mp4 for instance.
we create different quality versions from the same source so we also need to record that. So xyz01b/mp4/low.
we may change our compression method in the future so we might want to record which version was used to produce the file.

There are different ways to handle this versioning with S3:

use unique key to integrate all info. xyz01b_low.mp4 differs from xyz01b_regular.mp4 or xyz01b_low.webm. keys are flexible so we could also have mp4/low/v1/xyz01b giving us the ability to easily remove low-quality files created with encoder params v1.
use versioning of objects. this requires versioning-enabled bucket and has a lot of implication. Namely deletion is less straight-forward and understanding our usage is gonna be difficult and error-prone. Also permission apply to versions instead of object. Too much complexity for our use case.
use metadata on object. Allows setting key-value pairs on each object. While it's useful to store properties, it doesn't help much in checking/filtering.

Based on this, I propose that we use the file-path approach as follow:

mp4/
    low/
           <videoId>
           <videoId>
           <videoId>
    high/
webm/

So examples of valid Key for video files:

mp4/low/qbO7_ivMldc
webm/low/qbO7_ivMldc
mp4/high/qbO7_ivMldc
webm/high/qbO7_ivMldc

It means that we could end-up with 4 different object/file for a single Youtube video. In addition to this key, we will set an encoder_version meta on each object with the version of the encoding algorithm (this needs to be created).

As for implementation, this would require the introduction of two new parameters:

--optimization-cache=<url-with-credentials>. this enables the use of this feature. Scraper should remain usable without, obviously.
--use-any-optimized-version. with this enabled, scraper wouldn't compare the in-code encoder-version with the version stored in our S3.

Notes:

We want to use boto3 library in python to deal with S3. It's maintained by Amazon and is very reliable.
We want to abstract most of it so that scrapers don't waste code for the common glue. Especially, we want to pass all info to the scraper as a single URL (endpoint, credentials, bucket).
Eventually, we'll want to move this code into zimscraperlib but that would be a separate task once this is working. Just keep it in mind.
Abstraction layer for those common actions is already mostly done. Currently on the cardshop . This should be copied and serve as base. Once this all makes it to the scraperlib, cardshop will use the scraperlib as well.

nabinkhadka commented 4 years ago

@kelson42 @rgaudin The argument is --optimization-cache which sounds like is a boolean argument, but actually it is expecting the URL to S3. Do you agree? Should we change this?

rgaudin commented 4 years ago

It's fine. A boolean would be --optimize-cache or --use-optimization-cache. The latter is used in mwoffliner to get the URL ; that's why I changed it for youtube.

kelson42 commented 4 years ago

@nabinkhadka We should have something like --optimization-cache=http://s3.us-west-1.wasabisys.com/?bucketName=mwoffliner?keyId=TQKH8QZR63XT5AUALUP0?secretAccessKey=rI9XQsrTOImihm5aLf5UYI8eMClqJwsDGptbDYa3

rgaudin commented 4 years ago

FYI, I moved the storage module to an independent kiwixstorage lib. If you need changes on that module, please open PR there.

nabinkhadka commented 4 years ago

@rgaudin Does that mean we can/should use that as a library for youtube2zim?

I have done initial commit in this branch https://github.com/openzim/youtube/tree/Youtube-69-Introduce-optimization-cache. This includes only the implementation of optimization-cache while the implementation of use-any-optimized-version is yet in development (I do not understand this completely as of now). There are many things like videos, thumbnails, subtitles, etc that are downloaded using the youtube-dl. So I have used skip-download argument to download everything other than video if optimization-cache argument is used.

rgaudin commented 4 years ago

Thank you @nabinkhadka. Yes, obviously, you should use it instead.

Thank you for your commit but please:

make a proper pull request for me to review it
make sure you follow our code formatting (black).

rgaudin commented 4 years ago

As for other files, those changes over time (thumbnails, subtitles), just as the metadata and are quite small compared to the videos. That's why we didn't included those in the initial requirements.

We'll probably add them in a second pass but we need to check if they get consistent URLs and ETags.

nabinkhadka commented 4 years ago

May be user would also like to configure the initials of the key to store? To feel like they don't want to save to root directory?

rgaudin commented 4 years ago

May be user would also like to configure the initials of the key to store? To feel like they don't want to save to root directory?

No that's what buckets are for. Open a ticket if you think that's important so @kelson42 can gives his thoughts on this.

kelson42 commented 4 years ago

May be user would also like to configure the initials of the key to store? To feel like they don't want to save to root directory?

No that's what buckets are for. Open a ticket if you think that's important so @kelson42 can gives his thoughts on this.

Agree, this is the role of the bucket. So, already configurable.

openzim / youtube

Introduce optimization cache #69