mjishnu / pypdl

A concurrent pure python downloader with resume capablities
https://pypi.org/project/pypdl/
MIT License
44 stars 8 forks source link

bug: cannot perform multiple multi-segment downloads from certain servers if custom headers are passed to Pypdl() #23

Closed Speyedr closed 1 month ago

Speyedr commented 1 month ago

Code (bug_replication.py):

from pypdl import Pypdl
from time import sleep

download_list = [
    "https://safebooru.org//images/4619/3eb0ebb8b3a93515fa070f6be303527c48ffeed1.jpg",
    "https://safebooru.org//images/4619/55cdaa511197791cc818d0e9388e9f93afdd4c0d.jpg"
]

dl = Pypdl(allow_reuse=True, headers={
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"
})

for download in download_list:
    dl.start(download, block=True, segments=2)

    while dl.wait:
        print(f"Waiting for download from {download}...")
        sleep(0.2)

    while not dl.completed:
        print(f"Downloading from {download}...")
        sleep(0.2)

The output gets stuck at Downloading from https://safebooru.org//images/4619/55cdaa511197791cc818d0e9388e9f93afdd4c0d.jpg.... I set block=True to demonstrate that the downloader broke at the second download. The first completes normally.

The default error log message is lacking, however when using extensions from PR #22 , the message logged is:

(Pypdl)  21-07-24 15:21:29 - ERROR: (AttributeError) ['NoneType' object has no attribute 'get']
Traceback (most recent call last):
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 103, in download
    result = self._execute(
             ^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 175, in _execute
    file_path, multisegment, etag = self._get_info(
                                    ^^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 230, in _get_info
    file_path = get_filepath(url, header, file_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\utls.py", line 32, in get_filepath
    content_disposition = headers.get("Content-Disposition", None)
                          ^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get'

Further enhancing the debug log messages demonstrates what is being communicated between the client and server (from this commit):

(Pypdl)  19-07-24 16:08:00 - DEBUG: Ending HEAD request for https://safebooru.org//images/4619/3eb0ebb8b3a93515fa070f6be303527c48ffeed1.jpg. I sent: <CIMultiDict('User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0')>
(Pypdl)  19-07-24 16:08:00 - DEBUG: Sent headers: <CIMultiDictProxy('Host': 'safebooru.org', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate')>
(Pypdl)  19-07-24 16:08:00 - DEBUG: HEAD Response: 200
HEAD Response headers: <CIMultiDictProxy('Date': 'Fri, 19 Jul 2024 06:08:00 GMT', 'Content-Type': 'image/jpeg', 'Content-Length': '938120', 'Connection': 'keep-alive', 'Last-Modified': 'Thu, 18 Jul 2024 22:00:45 GMT', 'Etag': '"6699908d-e5088"', 'Expires': 'Thu, 31 Dec 2037 23:55:55 GMT', 'Cache-Control': 'max-age=315360000', 'CF-Cache-Status': 'HIT', 'Age': '183', 'Accept-Ranges': 'bytes', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=h%2F3QADyeTCi1dVNhs0%2F4TPhhT5NZKhlVllvXaB5zCNTkI%2BLRLLQKGVxxuQdpEC%2FnSYwgg0J6abigspb1oNuFI6pYejsQ1MHM5v09NFwA7SCJRJ1hv%2By1alLwx6DmfyjP"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '8a5888d11e8c8658-PER', 'alt-svc': 'h3=":443"; ma=86400')>

[...]

(Pypdl)  19-07-24 16:08:00 - DEBUG: Ending HEAD request for https://safebooru.org//images/4619/55cdaa511197791cc818d0e9388e9f93afdd4c0d.jpg. I sent: <CIMultiDict('User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0', 'range': 'bytes=469060-938119')>
(Pypdl)  19-07-24 16:08:00 - DEBUG: Sent headers: <CIMultiDictProxy('Host': 'safebooru.org', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0', 'range': 'bytes=469060-938119', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate')>
(Pypdl)  19-07-24 16:08:00 - DEBUG: HEAD Response: 206
HEAD Response headers: <CIMultiDictProxy('Date': 'Fri, 19 Jul 2024 06:08:00 GMT', 'Content-Type': 'image/jpeg', 'Content-Length': '57023', 'Connection': 'keep-alive', 'Last-Modified': 'Fri, 19 Jul 2024 05:00:30 GMT', 'Etag': '"6699f2ee-80703"', 'Expires': 'Thu, 31 Dec 2037 23:55:55 GMT', 'Cache-Control': 'max-age=315360000', 'CF-Cache-Status': 'HIT', 'Age': '138', 'Content-Range': 'bytes 469060-526082/526083', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=Ni4tWaMsTBUqToLDJdszhPDFANRd%2F907Pm50FRjsaH%2BqxDyKcbQuCWgQekoNzgAsCGx7h04VgBC9jR48WMt%2B0V9lT9lNM1jwYCapWOc04%2BDyFHTmk4rY4bBdP%2FxTKWvy"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '8a5888d5afe68643-PER', 'alt-svc': 'h3=":443"; ma=86400')>

Looking closely, we can see that on the second request, the client includes 'range': 'bytes=469060-938119' in the request header. This should not exist and is a remnant of a previous GET request header when downloading the first file. This additional range header is why the server responds with 206 instead of 200, which agrees with MDN HTTP docs.

Looking closely at the conditions for Pypdl._get_header(), we see that it has no handling for 206 responses, which is why the function returns None and later generates an exception. If the server had simply ignored the range header and returned status 200, the client would have continued as normal, which is likely what many servers end up doing. Additionally, aiohttp docs state that raise_for_status=True will only raise an exception if the response is greater than 400, which explains why the 206 response fell through.

https://github.com/mjishnu/pypdl/blob/12a20ad6285fb0f4fddcb2ff81742b646c142c6a/pypdl/pypdl_manager.py#L247-L257

mjishnu commented 1 month ago

great insight, thanks for reporting and fixing the bug.