secynic / ipwhois

Retrieve and parse whois data for IPv4 and IPv6 addresses
https://ipwhois.readthedocs.io/en/latest
BSD 2-Clause "Simplified" License
556 stars 121 forks source link

When using proxy #153

Open ciokan opened 7 years ago

ciokan commented 7 years ago

I see sockets still being used (without proxy) even when a proxy opener is provided and obj.lookup_rdap used. Is this safe when doing a lot of requests?

secynic commented 7 years ago

It only proxies the HTTP requests (RDAP lookups). You are probably seeing the sockets being used for ASN lookups via DNS. The ASN lookups are generally very fast, so I wouldn't worry about the overhead on those.

As a test, try setting bootstrap=True, and see if those non-proxied sockets disappear.

I am working on a bulk lookup solution for this library, so this would be a good consideration.

ciokan commented 7 years ago

I'm not worried about the overhead. I'm worried about getting banned when doing multiple requests since this thing is exposed as an API. For http proxies you will have to do a CONNECT via the sockets and authenticate. I have working code for that if you're interested. Ofc you will have to pull them out from the ProxyHandler.

If I find the time maybe I'll submit a PR. We will probably have to create a separate (single point) method that opens up sockets with consideration of the provided proxies which may be at least of types http with connect, socks5 and socks4

secynic commented 7 years ago

I haven't had any issues with bans, only rate limiting.

From what I understand (correct me if I'm not reading this correctly), if a proxy is provided, you want the ability to route all traffic (DNS, WHOIS, HTTP) over socks4/5/HTTP proxy? Also you mention proxies (plural); do you mean to load-balance across multiple proxy ips, or specify a different proxy server per lookup method?

Maybe you can clarify a bit. For instance, ASN lookups are best performed over DNS (https://github.com/secynic/ipwhois/blob/master/ipwhois/net.py#L217), but fallback to whois (https://github.com/secynic/ipwhois/blob/master/ipwhois/net.py#L285) and HTTP (https://github.com/secynic/ipwhois/blob/master/ipwhois/net.py#L384).

Edit: I just realized I skipped over your comment on CONNECT. Please elaborate on this, as I think you may be hinting at persistent connections, which wouldn't apply to the REST (RDAP) queries but may apply to the ASN lookups.

ciokan commented 7 years ago

I'm providing the proxy per lookup. Nothing fancy as this is something to be done by the user and I don't see it fit inside your package honestly as it's not a one size fits all type of thing.

Regarding traffic, whoever uses proxies does it for a reason so all traffic originating from this package should be done via the supplied proxy. People do it to overcome limitations such as banning or rate limiting.

Regarding the CONNECT bit, it was just a hint. I was referring to the parts when you open up sockets. Using http proxy inside the socket would require you to write some CONNECT directives into it (ofc the proxy would have to support CONNECT and have the port 43 open).

Here's a bit from what I'm using on a similar project:

def http_proxy_connect(address=None, proxy=None, auth=None):
    def valid_address(addr):
        """ Verify that an IP/port tuple is valid """
        return isinstance(addr, (list, tuple)) and len(addr) == 2 and isinstance(addr[0], str) and isinstance(addr[1], int)

    if not valid_address(address):
        raise ValueError('Invalid target address')

    if not proxy:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.connect(address)
        return s, 0, {}

    if not valid_address(proxy):
        raise ValueError('Invalid proxy address')

    _headers = {
        'host': address[0]
    }

    if auth:
        if isinstance(auth, str):
            _headers['proxy-authorization'] = auth
        elif auth and isinstance(auth, (tuple, list)):
            if len(auth) == 1:
                raise ValueError('Invalid authentication specification')

            t = auth[0]
            args = auth[1:]

            if t.lower() == 'basic' and len(args) == 2:
                auth_basic = "%s:%s" % args
                _headers['proxy-authorization'] = 'Basic ' + str(to_base64(auth_basic))
            else:
                raise ValueError('Invalid authentication specification')
        else:
            raise ValueError('Invalid authentication specification')

    s = socket.socket()
    s.connect(proxy)

    fp = s.makefile('rw')
    fp.write('CONNECT %s:%d HTTP/1.1\r\n' % address)
    fp.write('\r\n'.join('%s: %s' % (k, v) for (k, v) in _headers.items()) + '\r\n\r\n')
    fp.flush()

    statusline = fp.readline().rstrip('\r\n')

    if statusline.count(' ') < 2:
        fp.close()
        s.close()
        raise IOError('Bad response')

    version, _status, statusmsg = statusline.split(' ', 2)

    if not version in ('HTTP/1.0', 'HTTP/1.1'):
        fp.close()
        s.close()
        raise IOError('Unsupported HTTP version')
    try:
        _status = int(_status)
    except ValueError:
        fp.close()
        s.close()
        raise IOError('Bad response')

    response_headers = {}

    while True:
        tl = ''
        l = fp.readline().rstrip('\r\n')
        if l == '':
            break
        if not ':' in l:
            continue
        k, v = l.split(':', 1)
        response_headers[k.strip().lower()] = v.strip()

    fp.close()
    return s, _status, response_headers
secynic commented 7 years ago

Understood. Actually, I originally wrote the proxy support for corporate networks that commonly block outbound port 43. The library started with only whois, and more recently the RDAP protocol was introduced, so support was tacked on, and the library re-written. This was before anon proxies/vpns were very popular for these types of things.

That being said, you make a good point. Let me look over your code and see what we can do.

Thanks for the detailed info.

secynic commented 7 years ago

Update: I apologize for the delays. I have been busy with work and other side projects.

This is currently sitting in priority behind:

  1. #158 asn_alts deprecation
  2. 134 Bulk whois (has been open longer, and needs consideration for CONNECT implementation)

secynic commented 7 years ago

@ciokan I added bulk lookup support in experimental.py (ipwhois.experimental.bulk_lookup_rdap): https://github.com/secynic/ipwhois/blob/dev/ipwhois/experimental.py

You won't need to worry about getting banned for the ASN lookups, since the Cymru bulk ASN lookup can be done with a single request (ipwhois.experimental.get_bulk_asn_whois).

I believe this will solve your problem (at least for the short term). I would like to get v1.0.0 out soon, so I will open a new issue linked to this to be addressed in 1.x.x for individual queries. Let me know your thoughts, and if you get a chance to test.

secynic commented 7 years ago

Moving to 1.1.0 to remove any confusion, instead of opening a new issue.

secynic commented 6 years ago

@ciokan Did you get a chance to test this?