openzim / youtube

Create a ZIM file from a Youtube channel/username/playlist
GNU General Public License v3.0
54 stars 30 forks source link

Use multithreaded YouTube downloader from zimscraperlib #301

Closed dan-niles closed 2 months ago

dan-niles commented 3 months ago

Close #118

Use newer multithreaded YouTube downloader from zimscraperlib to download videos, thumbnails and subtitles.

Changes:

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 0% with 77 lines in your changes missing coverage. Please review.

Project coverage is 1.54%. Comparing base (eedd0b2) to head (69cb71c). Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
scraper/src/youtube2zim/scraper.py 0.00% 77 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #301 +/- ## ======================================== + Coverage 1.52% 1.54% +0.01% ======================================== Files 11 11 Lines 1112 1102 -10 Branches 166 158 -8 ======================================== Hits 17 17 + Misses 1095 1085 -10 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

rgaudin commented 3 months ago

Changed reviewer as @benoit74 is coming back

dan-niles commented 2 months ago

@benoit74 I've updated the code to handle errors correctly. It now properly returns both the succeeded and failed video_ids. I've also mentioned the time taken to create a ZIM of the openZIM_testing channel with different values for concurrency (without using s3 cache) below.

Concurrency Time (s)
1 40.16
2 25.29
4 21.11

However, I realized that most ZIMs in zimfarm are downloaded using the S3 cache. With this approach, only the videos that are not in the S3 cache and are downloaded directly using yt-dlp will benefit from the parallelization. This is because I'm using the ThreadPoolExecutor in YoutubeDownloader from python-scraperlib, rather than downloading in batches as suggested in https://github.com/openzim/youtube/issues/118#issue-700516092. Therefore, this change will only provide a performance gain in cases where videos are missing from the S3 cache.

Given that most downloads rely on the S3 cache, would it be better to keep the current approach of downloading in batches and using the YoutubeDownloader from scraperlib without concurrent execution?

WDYT?

benoit74 commented 2 months ago

Given that most downloads rely on the S3 cache, would it be better to keep the current approach of downloading in batches and using the YoutubeDownloader from scraperlib without concurrent execution?

This is indeed a very important point! Current implementation allow concurrent download of both S3 cached videos and Youtube videos, and also reencoding.

I'm sorry to not have noticed this before. I feel a bit ashamed to have wasted your time on this. Thank you for speaking up, it would have been a bad move indeed!

I propose to close the issue and the PR as "non-relevant", WDYT? Or at least we should enhance the scraperlib, but this is not a small change to implement.

dan-niles commented 2 months ago

No worries at all! I should have considered this earlier before moving forward with the implementation. I'm fine with closing this issue as "non-relevant." However, I agree that using the YouTube downloader from scraperlib restricts us from achieving concurrency with S3 downloads and video reencoding.