openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
18 stars 16 forks source link

Add YoutubeDownloader class in download.py #59

Closed satyamtg closed 3 years ago

satyamtg commented 3 years ago

This adds a YoutubeDownloader class in download.py to download youtube videos on a constant number of threads. It has a method called download() which allows it to work like a normal serial function that can be then parallelized in the scrapers. However, it downloads on its own executor, hence, the number of workers actually downloading the videos gets reduced to the limit that is set during initialization.

The YoutubeDownloader has the following members -

The YoutubeDownloader has the following methods -

We can run the download() method parallely on several number of threads, but videos will download in a fixed number of threads. We can do it something like this (a similar test is also there) -

    def download_video_and_convert(url, video_path, yt_downloader):
        # check if in the cache
        # download if not in cache
        downloaded_file = yt_downloader.download(url, video_path)
        # convert if necessary

    yt_downloader = YoutubeDownloader(threads=2)
    videos_list = [(url, video_path), (url, video_path) ... ]
    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        fs = [
            executor.submit(download_video_and_convert, url, video_path, yt_downloader)
            for url, video_path in videos_list
        ]
        done, not_done = concurrent.futures.wait(
            fs, return_when=concurrent.futures.ALL_COMPLETED
        )
    yt_downloader.shutdown()
    # continue scraper work
codecov[bot] commented 3 years ago

Codecov Report

Merging #59 into master will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #59   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           24        24           
  Lines          930       974   +44     
=========================================
+ Hits           930       974   +44     
Impacted Files Coverage Δ
src/zimscraperlib/download.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e632586...7305f33. Read the comment docs.

satyamtg commented 3 years ago

Have done those changes. I think it looks better now.

rgaudin commented 3 years ago

Thanks, as discussed once, don't rewrite commits until it's reviewed otherwise we can't use partial diff.