Closed nabinkhadka closed 4 years ago
@rgaudin Yes, there are many things to work on and that is the reason I said "working on issue #69" in the description above. And so I hadn't created a PR and wanted to check with you to see if I am going right direction. But thank you for your time and comments.
Yes, all of this is fixable. Hit the re-request button when you're ready.
@rgaudin
post_process_video
function which runs the recompress_video
? If so why would we save a non-optimized version?
- What is a non-optimized version? Is it the one before running
post_process_video
function which runs therecompress_video
? If so why would we save a non-optimized version?
Yes that's the one. We ask google for the best video it can give us in the format requested but don't ask for a specific quality because those are not reliably available on videos. Instead we get that video file and we recompress it anyway (with different options depending on --low-quality
). post-processing also includes thumbnail resizing I think.
We want to cache that compressed version so that, if present in cache, we save on both bandwith and processing time/resources.
In your PR, you upload (and thus cache) the received video before processing it. We want to upload the optimized version (hence the parameter name).
I'll use this opportunity to elaborate on the other requirement: --use-any-optimized-version
.
This processing/reencoding code, if changed, has a lot of impact on the video : they can look better/worst or filesize can change drastically.
So we want to introduce a version string that defines the encoder code. It's just something that we change whenever we make significant changes to the encoding code.
So we want to also store that encoder-version on all the files we upload to the cache (as a meta). This would allow us to programmaticly (not in youtube2zim
) remove older version for instance.
In reality, we may have to keep running the old-encoder-version-video files because we don't have the time or capacity to re-download from youtube, re-encode and re-upload.
So in this case, we use --use-any-optimized-version
to say “if there's a cache for this file, no matter the encoder-version, give it to me”.
Otherwise (default), you check for the existence of a video in cache, according to the encoder-version.
The kiwixstorage lib already supports this (see has_object_matching_meta()
).
Let me know if this is clearer now. it's a lot of details I know :)
- Why do we need to know if anything has failed except for logging message? If the download fails, it's safe to download using youtube-dl as per my perspective. And if the upload fails, we simply don't do it. There is no point in counting how many cache download/upload failed because the user will eventually look for result no matter how they get.
maybe some user doing this just once doesn't care ; but we are not a regular user. we create a lot of ZIM files containing a lot (thousands) of videos… every month. It is critical that we know when something that is supposed to work doesn't; especially on retrieving and saving those videos because failure there means that we'll be transferring off youtube, exposing ourselves to IP restrictions, jeopardizing all the ZIM files.
So it's very important. And we want details (exception trace). Out of scope for you but we may want to crash on such scenarios in the future.
- I have used python black for formatting and docstrings are also present. Is there some other standards to follow regarding these?
Yes, the failing black file is not the one you worked one. My bad. Thank you for picking that up, it's much appreciated 👍
As for docstring, there's no written rule but you'll notice they all look somewhat similar and have a similar wording style. Here's an example:
# that's yours
def s3_cache_found(self, video_id, local_video_path):
"""
Download file if present
return True if cache is found
else return False
"""
pass
# what I'd prefer to have
def s3_cache_found(self, video_id, local_video_path):
""" whether the file was found (and downloaded from) in cache """
pass
Hope that's understandable. If not, we'll fix those together once the rest of done.
zimscraperlib
and kiwixstorage
have conflict regarding requests
, so the fix needs to be on kiwixstorage
rather. I believe this will help build the docker image. Also, I have removed requests
from the requirements.txt
file in recent commit--low-quality
argument either of these combine together and will be used for encoding. Hence there can be two values for encoder_version
meta on each object. Is that correct? Or we save encoding information of audio and video information separately in meta?encoder_version
can have 3 different values?if there's a cache for this file, no matter the encoder-version, give it to me
means we look for video_format/*/videoId
where *
can either be low
or high
, correct? Or just ignore what is in meta?So it's very important. And we want details (exception trace). Out of scope for you but we may want to crash on such scenarios in the future.
I have added error log. But if needed, I can raise an error. If there is proper exceptions (like ClientError, NetworkError, etc) defined in the kiwixstorage
package, it would be the best way of doing it. What else could be done?Even though I have asked these questions, I am happy that I can start working on --use-any-optimized-version
argument.
zimscraperlib
andkiwixstorage
have conflict regardingrequests
, so the fix needs to be onkiwixstorage
rather. I believe this will help build the docker image. Also, I have removedrequests
from therequirements.txt
file in recent commit
I think we need to change both. requests uses semantic versioning as well so I'll update kiwixstorage to accept >=2.23,<3.0
. We should also update here and the same version string would work fine for us as well.
You can put git master in requirements for kiwixstorage while you develop. We'll make a new release of kiwixstorage once we've fixed everything you need.
- Right now there are two options while recompressing, viz libvpx and h264 for video and libvorbis and aacfor for audio. So depending upon the
--low-quality
argument either of these combine together and will be used for encoding. Hence there can be two values forencoder_version
meta on each object. Is that correct? Or we save encoding information of audio and video information separately in meta?
No we don't need that granularity. That would be a nightmare to manage. Just a simple version string to match the code version that was used to encode the video.
- If encoder_version is simply strings like v1, v2, etc, who defines what v1 is, what v2 is, etc?
the developer making changes to the ffmpeg params. you will have to add v1
for current version and that will stay like that until we change ffmpeg params in a way that has significant impact on the video files.
- And in case we introduce new encodings like h999 instead of h264, then the
encoder_version
can have 3 different values?
Nope, as I said, if we change the codec, we'll just bump the version for all.
if there's a cache for this file, no matter the encoder-version, give it to me
means we look forvideo_format/*/videoId
where*
can either below
orhigh
, correct? Or just ignore what is in meta?
No, just don't filter using the meta but the key remains the same ({video_format}/{quality}/{videoId}
)
So it's very important. And we want details (exception trace). Out of scope for you but we may want to crash on such scenarios in the future.
I have added error log. But if needed, I can raise an error. If there is proper exceptions (like ClientError, NetworkError, etc) defined in thekiwixstorage
package, it would be the best way of doing it. What else could be done?
I think error log is good enough for now.
* If some video is found in cache, I don't think we need to recompress it as it already has been compressed and saved on S3 storage. Or there could be any other scenario?
No, if downloaded from cache, no need to recompress. that's the point. That's why it's important you upload the compressed version.
Even though I have asked these questions, I am happy that I can start working on
--use-any-optimized-version
argument.
Good. Hopefully answers above regarding the encoder version should allow you to.
Any suggestion what to do for this line? https://github.com/openzim/youtube/blob/58748bee394c61d90a432b1a44d91cd8f1e3def4/youtube2zim/scraper.py#L288
This won't be true if cache is found, but this line comes before checking the cache presence.
Any suggestion what to do for this line?
This won't be true if cache is found, but this line comes before checking the cache presence.
Change it to Quality: regular|low or something. That's mostly useful to know in the logs what was requested.
Forked the kiwixstorage lib and made dependency version fix here. Using this docker image was build using minor changes (see attachment)
. Tested my arguments, works pretty good.
docker run -v test_dir:/output kiwix/youtube:dev youtube2zim --type "channel" --id XXXXX --name XXX --api-key XXXX-XXXX --debug --no-zim --low-quality --optimization-cache "https://s3.eu-central-1.wasabisys.com/?keyId=XXXX&secretAccessKey=XXXX&bucketName=XXX" --use-any-optimized-version
I have made the PR ready for review.
Why did you fork it? I've fix the reqs in the lib when I commented and asked you to you its master and bump our requests to >=2.23,<3.0
. Can you please do that while I review the rest?
Found a bug. in post_process_video
# don't reencode if not requesting low-quality and received wanted format
if not low_quality and src_path.suffix[1:] == video_format:
return
dst_path = src_path.parent.joinpath(f"video.{video_format}")
if recompress:
recompress_video(src_path, dst_path, video_format, low_quality)
your if statement is unreachable if you don't use low-quality (and you most likely received proper format)… and so we upload uncompressed version to cache.
@rgaudin I do not understand how my change introduced the bug. My change was only if recompress
, it was returning anyways even without that. But I feel like you are saying if not low_quality and src_path.suffix[1:] == video_format and not recompress:
@rgaudin I do not understand how my change introduced the bug. My change was only
if recompress
, it was returning anyways even without that. But I feel like you are sayingif not low_quality and src_path.suffix[1:] == video_format and not recompress:
Ah… let me figure this out.
This is what I propose if I understand the problem correctly.
Well it's a bug that was there before. We used to recompress slightly in HQ videos but I do remember writing this shortcut (don't reencode if not LQ). I need to confirm wanted behavior with @kelson42. Nevermind my comment, I'll fix that later. Are you done with everything else?
@rgaudin Yes, but do you want me to send the change as shown in the screenshot attached above? Looks pretty safe to me. Please let me know.
No, don't.
Was planning to send with other commit. But done now.
Issue #69