mjishnu / pypdl

A concurrent pure python downloader with resume capablities
https://pypi.org/project/pypdl/
MIT License
51 stars 9 forks source link
concurrent-downloads download download-file download-manager downloader downloadmanager python

pypdl

pypdl is a Python library for downloading files from the internet. It provides features such as multi-segmented downloads, retry download in case of failure, option to continue downloading using a different URL if necessary, progress tracking, pause/resume functionality, checksum and many more.

Table of Contents

Prerequisites

Installation

To install the pypdl, run the following command:

pip install pypdl

Usage

Basic Usage

To download a file using the pypdl, simply create a new Pypdl object and call its start method, passing in the URL of the file to be downloaded:

from pypdl import Pypdl

dl = Pypdl()
dl.start('http://example.com/file.txt')

Advanced Usage

The Pypdl object provides additional options for advanced usage:

from pypdl import Pypdl

dl = Pypdl(allow_reuse=False, logger=default_logger("Pypdl"))
dl.start(
    url='http://example.com/file.txt',
    file_path='file.txt',
    segments=10,
    display=True,
    multisegment=True,
    block=True,
    retries=0,
    mirror_func=None,
    etag=True,
    overwrite=False
)

Each option is explained below:

Examples

Here is an example that demonstrates how to use pypdl library to download a file using headers, proxy and timeout:

import aiohttp
from pypdl import Pypdl

def main():
    # Using headers
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"}
    # Using proxy
    proxy = "http://user:pass@some.proxy.com"
    # Using timeout
    timeout = aiohttp.ClientTimeout(sock_read=20)

    # create a new pypdl object
    dl = Pypdl(headers=headers, proxy=proxy, timeout=timeout)

    # start the download
    dl.start(
        url='https://speed.hetzner.de/100MB.bin',
        file_path='100MB.bin',
        segments=10,
        display=True,
        multisegment=True,
        block=True,
        retries=3,
        mirror_func=None,
        etag=True,
    )

if __name__ == '__main__':
    main()

This example downloads a file from the internet using 10 segments and displays the download progress. If the download fails, it will retry up to 3 times. we are also using headers, proxy and timeout, For more info regarding these parameters refer API reference

Another example of implementing pause resume functionality, printing the progress to console and changing log level to debug:

from pypdl import Pypdl

# create a pypdl object
dl = Pypdl()

# changing log level to debug
dl.logger.setLevel('DEBUG')

# start the download process
# block=False so we can print the progress
# display=False so we can print the progress ourselves
dl.start('https://example.com/file.zip', segments=8,block=False,display=False)

# print the progress
while dl.progress != 70:
  print(dl.progress)

# stop the download process
dl.stop() 

#do something
#...

# resume the download process
dl.start('https://example.com/file.zip', segments=8,block=False,display=False)

# print rest of the progress
while not d.completed:
  print(dl.progress)

This example we start the download process and print the progress to console. We then stop the download process and do something else. After that we resume the download process and print the rest of the progress to console. This can be used to create a pause/resume functionality.

Another example of using hash validation with dynamic url:

from pypdl import Pypdl

# Generate the url dynamically
def dynamic_url():
    return 'https://example.com/file.zip'

# create a pypdl object
dl = Pypdl()

# if block = True --> returns a FileValidator object
file = dl.start(dynamic_url, block=True) 

# validate hash
if file.validate_hash(correct_hash,'sha256'):
    print('Hash is valid')
else:
    print('Hash is invalid')

# scenario where block = False --> returns a AutoShutdownFuture object
file = dl.start(dynamic_url, block=False)

# do something
# ...

# validate hash
if dl.completed:
  if file.result().validate_hash(correct_hash,'sha256'):
      print('Hash is valid')
  else:
      print('Hash is invalid')

An example of using Pypdl object to get size of the files with allow_reuse set to True and custom logger:

import logging
import time
from pypdl import Pypdl

urls = [
    'https://example.com/file1.zip',
    'https://example.com/file2.zip',
    'https://example.com/file3.zip',
    'https://example.com/file4.zip',
    'https://example.com/file5.zip',
]

# create a custom logger
logger = logging.getLogger('custom')

size = []

# create a pypdl object
dl = Pypdl(allow_reuse=True, logger=logger)

for url in urls:
    dl.start(url, block=False)

    # waiting for the size and other preliminary data to be retrived
    while dl.wait:
        time.sleep(0.1)

    # get the size of the file and add it to size list
    size.append(dl.size)

    # do something 

    while not dl.completed:
        print(dl.progress)

print(size)
# shutdown the downloader, this is essential when allow_reuse is enabled
dl.shutdown()

An example of using PypdlFactory to download multiple files concurrently:

from pypdl import PypdlFactory

proxy = "http://user:pass@some.proxy.com"

# create a PypdlFactory object
factory = PypdlFactory(instances=5, allow_reuse=True, proxy=proxy)

# List of tasks to be downloaded. Each task is a tuple of (URL, {Pypdl arguments}).
# - URL: The download link (string).
# - {Pypdl arguments}: A dictionary of arguments supported by `Pypdl`.
tasks = [
    ('https://example.com/file1.zip', {'file_path': 'file1.zip'}),
    ('https://example.com/file2.zip', {'file_path': 'file2.zip'}),
    ('https://example.com/file3.zip', {'file_path': 'file3.zip'}),
    ('https://example.com/file4.zip', {'file_path': 'file4.zip'}),
    ('https://example.com/file5.zip', {'file_path': 'file5.zip'}),
]

# start the download process
results = factory.start(tasks, display=True, block=False)

# do something
# ...

# stop the download process
factory.stop()

# do something
# ...

# restart the download process
results = factory.start(tasks, display=True, block=True)

# print the results
for url, result in results:
    # validate hash
    if result.validate_hash(correct_hash,'sha256'):
        print(f'{url} - Hash is valid')
    else:
        print(f'{url} - Hash is invalid')

task2 = [
    ('https://example.com/file6.zip', {'file_path': 'file6.zip'}),
    ('https://example.com/file7.zip', {'file_path': 'file7.zip'}),
    ('https://example.com/file8.zip', {'file_path': 'file8.zip'}),
    ('https://example.com/file9.zip', {'file_path': 'file9.zip'}),
    ('https://example.com/file10.zip', {'file_path': 'file10.zip'}),
]

# start the download process
factory.start(task2, display=True, block=True)

# shutdown the downloader, this is essential when allow_reuse is enabled
factory.shutdown()

For more detailed info about parameters refer API reference

API Reference

Pypdl()

The Pypdl class represents a file downloader that can download a file from a given URL to a specified file path. The class supports both single-segmented and multi-segmented downloads and many other features like retry download incase of failure and option to continue downloading using a different url if necessary, pause/resume functionality, progress tracking etc.

Arguments

Attributes

Methods

PypdlFactory()

The PypdlFactory class manages multiple instances of the Pypdl downloader. It allows for concurrent downloads and provides progress tracking across all active downloads.

Arguments

Attributes

Methods

Helper Classes

Basicdown()

The Basicdown class is the base downloader class that provides the basic structure for downloading files.

Attributes
Methods

Singledown()

The Singledown class extends Basicdown and is responsible for downloading a whole file in a single segment.

Methods

Multidown()

The Multidown class extends Basicdown and is responsible for downloading a specific segment of a file.

Methods

FileValidator()

The FileValidator class is used to validate the integrity of the downloaded file.

Parameters
Methods

AutoShutdownFuture()

The AutoShutdownFuture class is a wrapper for concurrent.futures.Future object that shuts down a list of associated executors when the result is retrieved.

Parameters
Methods

License

pypdl is licensed under the MIT License. See the LICENSE file for more details.

Contribution

Contributions to pypdl are always welcome. If you want to contribute to this project, please fork the repository and submit a pull request.

Contact

If you have any questions, issues, or feedback about pypdl, please open an issue on the GitHub repository.