Closed kelson42 closed 4 years ago
The goal of this issue is to avoid redoing work that has been done already. The way this scraper works is by doing the following:
webm
, mp4
)low_quality
param.Videos are not updated on youtube side once they are published so what we want it whenever we want to create a new version of a ZIM file, we'd only have to do steps 2 & 3 for new videos.
For that, we are setting up a dedicated online storage based on AWS S3. We're using a different provider but the API is similar.
In the new wanted behavior, for each video, we should:
This is the same mechanism as for other scraper (ATM only mwoffliner implements it).
In S3, you store files using two different information:
bucket
which is like a root folder. We'll use a single one for all the videos we scrape.key
which is a string identifying the file. It can be anything. People usually use filenames, or path-like names or full URL for instance. mwoffliner uses URL of the resource it caches as key.ETag
for all file. This is generated server-side. The documentation doesn't enforces an implementation so this can be anything. Fortunately, most web servers use md5
hash as it's speedy and reliable.Now, here's the huge difference with youtube:
xyz01b
but we have xyz01b/mp4
for instance.xyz01b/mp4/low
.There are different ways to handle this versioning with S3:
key
to integrate all info. xyz01b_low.mp4
differs from xyz01b_regular.mp4
or xyz01b_low.webm
. keys are flexible so we could also have mp4/low/v1/xyz01b
giving us the ability to easily remove low-quality files created with encoder params v1
.Based on this, I propose that we use the file-path approach as follow:
mp4/
low/
<videoId>
<videoId>
<videoId>
high/
webm/
So examples of valid Key
for video files:
mp4/low/qbO7_ivMldc
webm/low/qbO7_ivMldc
mp4/high/qbO7_ivMldc
webm/high/qbO7_ivMldc
It means that we could end-up with 4 different object/file for a single Youtube video. In addition to this key, we will set an encoder_version
meta on each object with the version of the encoding algorithm (this needs to be created).
As for implementation, this would require the introduction of two new parameters:
--optimization-cache=<url-with-credentials>
. this enables the use of this feature. Scraper should remain usable without, obviously.--use-any-optimized-version
. with this enabled, scraper wouldn't compare the in-code encoder-version with the version stored in our S3.Notes:
@kelson42 @rgaudin The argument is --optimization-cache
which sounds like is a boolean argument, but actually it is expecting the URL to S3. Do you agree? Should we change this?
It's fine. A boolean would be --optimize-cache
or --use-optimization-cache
. The latter is used in mwoffliner to get the URL ; that's why I changed it for youtube.
@nabinkhadka We should have something like --optimization-cache=http://s3.us-west-1.wasabisys.com/?bucketName=mwoffliner?keyId=TQKH8QZR63XT5AUALUP0?secretAccessKey=rI9XQsrTOImihm5aLf5UYI8eMClqJwsDGptbDYa3
FYI, I moved the storage module to an independent kiwixstorage lib. If you need changes on that module, please open PR there.
@rgaudin Does that mean we can/should use that as a library for youtube2zim?
I have done initial commit in this branch https://github.com/openzim/youtube/tree/Youtube-69-Introduce-optimization-cache. This includes only the implementation of optimization-cache
while the implementation of use-any-optimized-version
is yet in development (I do not understand this completely as of now). There are many things like videos, thumbnails, subtitles, etc that are downloaded using the youtube-dl. So I have used skip-download argument to download everything other than video if optimization-cache
argument is used.
Thank you @nabinkhadka. Yes, obviously, you should use it instead.
Thank you for your commit but please:
As for other files, those changes over time (thumbnails, subtitles), just as the metadata and are quite small compared to the videos. That's why we didn't included those in the initial requirements.
We'll probably add them in a second pass but we need to check if they get consistent URLs and ETags.
May be user would also like to configure the initials of the key to store? To feel like they don't want to save to root directory?
May be user would also like to configure the initials of the key to store? To feel like they don't want to save to root directory?
No that's what buckets are for. Open a ticket if you think that's important so @kelson42 can gives his thoughts on this.
May be user would also like to configure the initials of the key to store? To feel like they don't want to save to root directory?
No that's what buckets are for. Open a ticket if you think that's important so @kelson42 can gives his thoughts on this.
Agree, this is the role of the bucket. So, already configurable.
Like in MWoffliner, see https://github.com/openzim/mwoffliner/issues/996.
If the option to deal with S3 should be availbel at
youtube2zim
level the implementation should be made generic atpython_scraperlib
level so we can reuse it for an other project.