rpm-software-management / urlgrabber

GNU Lesser General Public License v2.1
14 stars 23 forks source link

Should `urlparser.parse` always return `bytes`? #33

Open agraul opened 2 years ago

agraul commented 2 years ago

Hi, I think the current behavior of urlparser.parse could be improved. Using Python 3, right now it always returns bytes for the url and all parts, ignoring if the input is a str or bytes. That's especially surprising because one consequence is that find_proxies does not work on parts[0] (which is the scheme). Below is an isolated reproducer:

from urlgrabber.grabber import URLGrabberOptions

def str_in_bytes_out(fixed=False):
    opts = URLGrabberOptions()
    url_in = "https://github.com"
    url_out, _ = opts.urlparser.parse(url_in, opts)

    if fixed:
        assert url_in == url_out.decode('utf-8')
    else:
        assert url_in == url_out

def parsed_url_parts_fail_with_find_proxy(fixed=False):
    opts = URLGrabberOptions(proxies={'http': "http://proxy.example", 'https': "https://proxy.example"})
    url_in = "https://github.com"
    url_out, parts = opts.urlparser.parse(url_in, opts)
    scheme, *_ = parts
    opts.find_proxy(url_out, scheme)

    if fixed:
        opts.find_proxy(url_out, scheme.decode('utf-8'))
    assert opts.proxy
meaksh commented 2 years ago

AFAICS, the urlparser.parse function seems currently expected to always return bytes, as we always take the input url and pass it _to_utf8 to ensure we have bytes and not a string.

So, when parsing, we expect schema and parts to be bytes. I see the rest of the functions are taking this into account, but it seems there is a bug in the logic for find_proxy which is not doing the proper comparison, making it to fail assigning the right proxy.

See https://github.com/rpm-software-management/urlgrabber/pull/35/commits/6b08b5267a91da0e82a6593212234bfae3310f39