openzim / youtube

Create a ZIM file from a Youtube channel/username/playlist
GNU General Public License v3.0
52 stars 29 forks source link

Introduce optimization cache #69

Closed kelson42 closed 4 years ago

kelson42 commented 4 years ago

Like in MWoffliner, see https://github.com/openzim/mwoffliner/issues/996.

If the option to deal with S3 should be availbel at youtube2zim level the implementation should be made generic at python_scraperlib level so we can reuse it for an other project.

rgaudin commented 4 years ago

The goal of this issue is to avoid redoing work that has been done already. The way this scraper works is by doing the following:

  1. compute a list of video Ids from a request
  2. download all videos into individual files, using a specific format based on request (webm, mp4)
  3. recompress video files using different settings based on low_quality param.
  4. create a ZIM

Videos are not updated on youtube side once they are published so what we want it whenever we want to create a new version of a ZIM file, we'd only have to do steps 2 & 3 for new videos.

For that, we are setting up a dedicated online storage based on AWS S3. We're using a different provider but the API is similar.

In the new wanted behavior, for each video, we should:

  1. check if we already have it in S3 (see below)
  2. if we do, download it from S3 and move to next
  3. if we don't, download from youtube and process as before then upload to S3.

This is the same mechanism as for other scraper (ATM only mwoffliner implements it).

In S3, you store files using two different information:

Now, here's the huge difference with youtube:

There are different ways to handle this versioning with S3:

Based on this, I propose that we use the file-path approach as follow:

mp4/
    low/
           <videoId>
           <videoId>
           <videoId>
    high/
webm/

So examples of valid Key for video files:

It means that we could end-up with 4 different object/file for a single Youtube video. In addition to this key, we will set an encoder_version meta on each object with the version of the encoding algorithm (this needs to be created).

As for implementation, this would require the introduction of two new parameters:

Notes:

nabinkhadka commented 4 years ago

@kelson42 @rgaudin The argument is --optimization-cache which sounds like is a boolean argument, but actually it is expecting the URL to S3. Do you agree? Should we change this?

rgaudin commented 4 years ago

It's fine. A boolean would be --optimize-cache or --use-optimization-cache. The latter is used in mwoffliner to get the URL ; that's why I changed it for youtube.

kelson42 commented 4 years ago

@nabinkhadka We should have something like --optimization-cache=http://s3.us-west-1.wasabisys.com/?bucketName=mwoffliner?keyId=TQKH8QZR63XT5AUALUP0?secretAccessKey=rI9XQsrTOImihm5aLf5UYI8eMClqJwsDGptbDYa3

rgaudin commented 4 years ago

FYI, I moved the storage module to an independent kiwixstorage lib. If you need changes on that module, please open PR there.

nabinkhadka commented 4 years ago

@rgaudin Does that mean we can/should use that as a library for youtube2zim?

I have done initial commit in this branch https://github.com/openzim/youtube/tree/Youtube-69-Introduce-optimization-cache. This includes only the implementation of optimization-cache while the implementation of use-any-optimized-version is yet in development (I do not understand this completely as of now). There are many things like videos, thumbnails, subtitles, etc that are downloaded using the youtube-dl. So I have used skip-download argument to download everything other than video if optimization-cache argument is used.

rgaudin commented 4 years ago

Thank you @nabinkhadka. Yes, obviously, you should use it instead.

Thank you for your commit but please:

rgaudin commented 4 years ago

As for other files, those changes over time (thumbnails, subtitles), just as the metadata and are quite small compared to the videos. That's why we didn't included those in the initial requirements.

We'll probably add them in a second pass but we need to check if they get consistent URLs and ETags.

nabinkhadka commented 4 years ago

May be user would also like to configure the initials of the key to store? To feel like they don't want to save to root directory?

rgaudin commented 4 years ago

May be user would also like to configure the initials of the key to store? To feel like they don't want to save to root directory?

No that's what buckets are for. Open a ticket if you think that's important so @kelson42 can gives his thoughts on this.

kelson42 commented 4 years ago

May be user would also like to configure the initials of the key to store? To feel like they don't want to save to root directory?

No that's what buckets are for. Open a ticket if you think that's important so @kelson42 can gives his thoughts on this.

Agree, this is the role of the bucket. So, already configurable.