openzim / youtube

Create a ZIM file from a Youtube channel/username/playlist
GNU General Public License v3.0
52 stars 29 forks source link

Fixed #69: introduce optimization cache #79

Closed nabinkhadka closed 4 years ago

nabinkhadka commented 4 years ago

Issue #69

nabinkhadka commented 4 years ago

@rgaudin Yes, there are many things to work on and that is the reason I said "working on issue #69" in the description above. And so I hadn't created a PR and wanted to check with you to see if I am going right direction. But thank you for your time and comments.

rgaudin commented 4 years ago

Yes, all of this is fixable. Hit the re-request button when you're ready.

nabinkhadka commented 4 years ago

@rgaudin

rgaudin commented 4 years ago
  • What is a non-optimized version? Is it the one before running post_process_video function which runs the recompress_video? If so why would we save a non-optimized version?

Yes that's the one. We ask google for the best video it can give us in the format requested but don't ask for a specific quality because those are not reliably available on videos. Instead we get that video file and we recompress it anyway (with different options depending on --low-quality). post-processing also includes thumbnail resizing I think. We want to cache that compressed version so that, if present in cache, we save on both bandwith and processing time/resources. In your PR, you upload (and thus cache) the received video before processing it. We want to upload the optimized version (hence the parameter name).

I'll use this opportunity to elaborate on the other requirement: --use-any-optimized-version. This processing/reencoding code, if changed, has a lot of impact on the video : they can look better/worst or filesize can change drastically. So we want to introduce a version string that defines the encoder code. It's just something that we change whenever we make significant changes to the encoding code.

So we want to also store that encoder-version on all the files we upload to the cache (as a meta). This would allow us to programmaticly (not in youtube2zim) remove older version for instance. In reality, we may have to keep running the old-encoder-version-video files because we don't have the time or capacity to re-download from youtube, re-encode and re-upload. So in this case, we use --use-any-optimized-version to say “if there's a cache for this file, no matter the encoder-version, give it to me”. Otherwise (default), you check for the existence of a video in cache, according to the encoder-version. The kiwixstorage lib already supports this (see has_object_matching_meta()).

Let me know if this is clearer now. it's a lot of details I know :)

  • Why do we need to know if anything has failed except for logging message? If the download fails, it's safe to download using youtube-dl as per my perspective. And if the upload fails, we simply don't do it. There is no point in counting how many cache download/upload failed because the user will eventually look for result no matter how they get.

maybe some user doing this just once doesn't care ; but we are not a regular user. we create a lot of ZIM files containing a lot (thousands) of videos… every month. It is critical that we know when something that is supposed to work doesn't; especially on retrieving and saving those videos because failure there means that we'll be transferring off youtube, exposing ourselves to IP restrictions, jeopardizing all the ZIM files.

So it's very important. And we want details (exception trace). Out of scope for you but we may want to crash on such scenarios in the future.

  • I have used python black for formatting and docstrings are also present. Is there some other standards to follow regarding these?

Yes, the failing black file is not the one you worked one. My bad. Thank you for picking that up, it's much appreciated 👍

As for docstring, there's no written rule but you'll notice they all look somewhat similar and have a similar wording style. Here's an example:

    # that's yours
    def s3_cache_found(self, video_id, local_video_path):
        """
        Download file if present
        return True if cache is found
        else return False
        """
        pass

    # what I'd prefer to have
    def s3_cache_found(self, video_id, local_video_path):
        """ whether the file was found (and downloaded from) in cache """ 
        pass

Hope that's understandable. If not, we'll fix those together once the rest of done.

nabinkhadka commented 4 years ago

Even though I have asked these questions, I am happy that I can start working on --use-any-optimized-version argument.

rgaudin commented 4 years ago

I think we need to change both. requests uses semantic versioning as well so I'll update kiwixstorage to accept >=2.23,<3.0. We should also update here and the same version string would work fine for us as well. You can put git master in requirements for kiwixstorage while you develop. We'll make a new release of kiwixstorage once we've fixed everything you need.

No we don't need that granularity. That would be a nightmare to manage. Just a simple version string to match the code version that was used to encode the video.

the developer making changes to the ffmpeg params. you will have to add v1 for current version and that will stay like that until we change ffmpeg params in a way that has significant impact on the video files.

Nope, as I said, if we change the codec, we'll just bump the version for all.

No, just don't filter using the meta but the key remains the same ({video_format}/{quality}/{videoId})

I think error log is good enough for now.

No, if downloaded from cache, no need to recompress. that's the point. That's why it's important you upload the compressed version.

Good. Hopefully answers above regarding the encoder version should allow you to.

nabinkhadka commented 4 years ago

Any suggestion what to do for this line? https://github.com/openzim/youtube/blob/58748bee394c61d90a432b1a44d91cd8f1e3def4/youtube2zim/scraper.py#L288

This won't be true if cache is found, but this line comes before checking the cache presence.

rgaudin commented 4 years ago

Any suggestion what to do for this line?

https://github.com/openzim/youtube/blob/58748bee394c61d90a432b1a44d91cd8f1e3def4/youtube2zim/scraper.py#L288

This won't be true if cache is found, but this line comes before checking the cache presence.

Change it to Quality: regular|low or something. That's mostly useful to know in the logs what was requested.

nabinkhadka commented 4 years ago

Forked the kiwixstorage lib and made dependency version fix here. Using this docker image was build using minor changes (see attachment)

Screen Shot 2020-04-16 at 19 23 54

. Tested my arguments, works pretty good.

docker run -v test_dir:/output kiwix/youtube:dev youtube2zim --type "channel" --id XXXXX --name XXX --api-key XXXX-XXXX --debug --no-zim --low-quality --optimization-cache "https://s3.eu-central-1.wasabisys.com/?keyId=XXXX&secretAccessKey=XXXX&bucketName=XXX" --use-any-optimized-version I have made the PR ready for review.

rgaudin commented 4 years ago

Why did you fork it? I've fix the reqs in the lib when I commented and asked you to you its master and bump our requests to >=2.23,<3.0. Can you please do that while I review the rest?

rgaudin commented 4 years ago

Found a bug. in post_process_video

   # don't reencode if not requesting low-quality and received wanted format
    if not low_quality and src_path.suffix[1:] == video_format:
        return

    dst_path = src_path.parent.joinpath(f"video.{video_format}")

    if recompress:
        recompress_video(src_path, dst_path, video_format, low_quality)

your if statement is unreachable if you don't use low-quality (and you most likely received proper format)… and so we upload uncompressed version to cache.

nabinkhadka commented 4 years ago

@rgaudin I do not understand how my change introduced the bug. My change was only if recompress, it was returning anyways even without that. But I feel like you are saying if not low_quality and src_path.suffix[1:] == video_format and not recompress:

rgaudin commented 4 years ago

@rgaudin I do not understand how my change introduced the bug. My change was only if recompress, it was returning anyways even without that. But I feel like you are saying if not low_quality and src_path.suffix[1:] == video_format and not recompress:

Ah… let me figure this out.

nabinkhadka commented 4 years ago
Screen Shot 2020-04-17 at 18 48 19

This is what I propose if I understand the problem correctly.

rgaudin commented 4 years ago

Well it's a bug that was there before. We used to recompress slightly in HQ videos but I do remember writing this shortcut (don't reencode if not LQ). I need to confirm wanted behavior with @kelson42. Nevermind my comment, I'll fix that later. Are you done with everything else?

nabinkhadka commented 4 years ago

@rgaudin Yes, but do you want me to send the change as shown in the screenshot attached above? Looks pretty safe to me. Please let me know.

rgaudin commented 4 years ago

No, don't.

nabinkhadka commented 4 years ago

Was planning to send with other commit. But done now.