Closed dan-niles closed 2 months ago
Attention: Patch coverage is 0%
with 77 lines
in your changes missing coverage. Please review.
Project coverage is 1.54%. Comparing base (
eedd0b2
) to head (69cb71c
). Report is 3 commits behind head on main.
Files with missing lines | Patch % | Lines |
---|---|---|
scraper/src/youtube2zim/scraper.py | 0.00% | 77 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Changed reviewer as @benoit74 is coming back
@benoit74 I've updated the code to handle errors correctly. It now properly returns both the succeeded
and failed
video_ids. I've also mentioned the time taken to create a ZIM of the openZIM_testing
channel with different values for concurrency
(without using s3 cache) below.
Concurrency | Time (s) |
---|---|
1 | 40.16 |
2 | 25.29 |
4 | 21.11 |
However, I realized that most ZIMs in zimfarm are downloaded using the S3 cache. With this approach, only the videos that are not in the S3 cache and are downloaded directly using yt-dlp
will benefit from the parallelization. This is because I'm using the ThreadPoolExecutor in YoutubeDownloader
from python-scraperlib
, rather than downloading in batches as suggested in https://github.com/openzim/youtube/issues/118#issue-700516092. Therefore, this change will only provide a performance gain in cases where videos are missing from the S3 cache.
Given that most downloads rely on the S3 cache, would it be better to keep the current approach of downloading in batches and using the YoutubeDownloader
from scraperlib without concurrent execution?
WDYT?
Given that most downloads rely on the S3 cache, would it be better to keep the current approach of downloading in batches and using the YoutubeDownloader from scraperlib without concurrent execution?
This is indeed a very important point! Current implementation allow concurrent download of both S3 cached videos and Youtube videos, and also reencoding.
I'm sorry to not have noticed this before. I feel a bit ashamed to have wasted your time on this. Thank you for speaking up, it would have been a bad move indeed!
I propose to close the issue and the PR as "non-relevant", WDYT? Or at least we should enhance the scraperlib, but this is not a small change to implement.
No worries at all! I should have considered this earlier before moving forward with the implementation. I'm fine with closing this issue as "non-relevant." However, I agree that using the YouTube downloader from scraperlib
restricts us from achieving concurrency with S3 downloads and video reencoding.
Close #118
Use newer multithreaded YouTube downloader from
zimscraperlib
to download videos, thumbnails and subtitles.Changes:
download_video_files_batch
method.download_video
,download_thumbnail
anddownload_subtitle
methods to use new multithreaded downloader from scraperlib.