mjishnu / pypdl

A concurrent pure python downloader with resume capablities
https://pypi.org/project/pypdl/
MIT License
44 stars 8 forks source link

Prevent sticky Pypdl._kwargs['headers']['range'] reference bug when reusing Pypdl object with user-supplied headers and multiple multi-segment downloads #24

Closed Speyedr closed 1 month ago

Speyedr commented 1 month ago

A simple workaround for a bug with a complicated chain of events, due an oversight in the usage of the Pypdl._kwargs attribute.

This PR fixes #23.

When creating a Pypdl object, one has the ability to pass in arbitrary headers which will then be forwarded later to an aiohttp.ClientSession() object. These headers are collected within **kwargs and then stored in the self._kwargs attribute:

https://github.com/mjishnu/pypdl/blob/12a20ad6285fb0f4fddcb2ff81742b646c142c6a/pypdl/pypdl_manager.py#L44-L48

This dictionary is passed to multiple objects during normal procedure, such as during the Pypdl._get_header() method...

https://github.com/mjishnu/pypdl/blob/12a20ad6285fb0f4fddcb2ff81742b646c142c6a/pypdl/pypdl_manager.py#L247-L252

...and the Pypdl._multi_segment() method:

https://github.com/mjishnu/pypdl/blob/12a20ad6285fb0f4fddcb2ff81742b646c142c6a/pypdl/pypdl_manager.py#L259-L270

In order to request ranges of bytes to perform multi-segment downloads, Multidown.worker() collects the user-supplied arguments in **kwargs and then later adds a range header, as this needs to be passed later to another aiohttp.ClientSession() object:

https://github.com/mjishnu/pypdl/blob/12a20ad6285fb0f4fddcb2ff81742b646c142c6a/pypdl/downloader.py#L46-L48 Line 65 updates the kwargs['headers'] dictionary. https://github.com/mjishnu/pypdl/blob/12a20ad6285fb0f4fddcb2ff81742b646c142c6a/pypdl/downloader.py#L63-L66

This has an unintended side-effect that:

https://github.com/mjishnu/pypdl/blob/12a20ad6285fb0f4fddcb2ff81742b646c142c6a/pypdl/pypdl_manager.py#L228-L230

https://github.com/mjishnu/pypdl/blob/12a20ad6285fb0f4fddcb2ff81742b646c142c6a/pypdl/utls.py#L31-L32

Stacktrace:

(Pypdl)  21-07-24 15:21:29 - ERROR: (AttributeError) ['NoneType' object has no attribute 'get']
Traceback (most recent call last):
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 103, in download
    result = self._execute(
             ^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 175, in _execute
    file_path, multisegment, etag = self._get_info(
                                    ^^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 230, in _get_info
    file_path = get_filepath(url, header, file_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\utls.py", line 32, in get_filepath
    content_disposition = headers.get("Content-Disposition", None)
                          ^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get'

This bug has likely gone unsighted because it requires:

This PR includes a simple addition to the Multidown.worker() method that creates a shallow copy of kwargs['headers'] if it already exists, which disconnects it from the original reference and allows the range header to be added without affecting Pypdl._kwargs['header'].

This fix is faster than alternative patches which may require the use of deepcopy to properly disconnect or reset the Pypdl._kwargs attribute, however as a consequence this PR does not prevent the same issue from occurring again (i.e. unintentionally modifying attributes of Pypdl that have been passed by reference to callee functions).

Speyedr commented 1 month ago

Fixed anti-pattern issue, might want to squash commits.