python / cpython

The Python programming language
https://www.python.org/
Other
61.25k stars 29.55k forks source link

urllib: urlretrieve() seems to ignore provided host header #96287

Open Evernow opened 1 year ago

Evernow commented 1 year ago

Bug report

I have been having issues confirming this, but urlretrieve seems to ignore the provided Host header even if it's added. It seems to correctly look at User-agent and Referer. I have two functions doing the same download, one with urlretrieve and one with requests. The requests one works as expected and fails in the same way urlretrieve fails if I remove the Host header.

def download_helper(url, fname):
            opener = urllib.request.build_opener()
            opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0'),
                                ('Referer', "https://www.amd.com/en/support/graphics/amd-radeon-6000-series/amd-radeon-6700-series/amd-radeon-rx-6700-xt"),
                                ('Host' , 'us.download.nvidia.com')]
            urllib.request.install_opener(opener)
            import ssl
            ssl._create_default_https_context = ssl._create_unverified_context
            urllib.request.urlretrieve(url, filename=fname)

def download_helper2(url, fname):
    my_referer = "https://www.amd.com/en/support/graphics/amd-radeon-6000-series/amd-radeon-6700-series/amd-radeon-rx-6700-xt"
    resp = requests.get(url, verify=False, stream=True, headers={
        'referer': my_referer,
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0',
        'Host' : 'us.download.nvidia.com'
        })
    total = int(resp.headers.get('content-length', 0))
    with open(fname, 'wb') as file:
        for data in resp.iter_content(chunk_size=1024):
            size = file.write(data)

download_helper2('https://192.229.211.70/Windows/516.94/516.94-desktop-win10-win11-64bit-international-dch-whql.exe', r'516.94-desktop-win10-win11-64bit-international-dch-whql.exe')

Your environment

tirkarthi commented 1 year ago

urlretrieve is a legacy interface that is not recommended as per docs. Regarding the issue, it seems a request object is constructed without Host header and hence the parsed host value from url is used. In the next loop since Host is already set the value you add in addheaders is skipped. You might want to try something modified from this page like below constructing your own request object with appropriate headers : https://docs.python.org/3/howto/urllib2.html?highlight=urllib2#fetching-urls

https://docs.python.org/3/library/urllib.request.html?highlight=urlretrieve#legacy-interface

https://github.com/python/cpython/blob/837ce6460d6554e9d7a847951ca5108ac6926ee5/Lib/urllib/request.py#L1293-L1302

# https://docs.python.org/3/howto/urllib2.html?highlight=urllib2#fetching-urls
import shutil
import tempfile
import urllib.request

def download_helper2(url, fname):
    HOST = "us.download.nvidia.com"
    headers = dict(
        [
            (
                "User-agent",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0",
            ),
            (
                "Referer",
                "https://www.amd.com/en/support/graphics/amd-radeon-6000-series/amd-radeon-6700-series/amd-radeon-rx-6700-xt",
            ),
            ("Host", HOST),
            ("test", "test"),
        ]
    )

    request = urllib.request.Request(url=url, headers=headers)

    with urllib.request.urlopen(request) as response:
        with open(fname, "wb") as file_:
            shutil.copyfileobj(response, file_)

download_helper2("http://localhost:8000/test", "/tmp/test")