Open ciokan opened 7 years ago
It only proxies the HTTP requests (RDAP lookups). You are probably seeing the sockets being used for ASN lookups via DNS. The ASN lookups are generally very fast, so I wouldn't worry about the overhead on those.
As a test, try setting bootstrap=True, and see if those non-proxied sockets disappear.
I am working on a bulk lookup solution for this library, so this would be a good consideration.
I'm not worried about the overhead. I'm worried about getting banned when doing multiple requests since this thing is exposed as an API. For http proxies you will have to do a CONNECT
via the sockets and authenticate. I have working code for that if you're interested. Ofc you will have to pull them out from the ProxyHandler
.
If I find the time maybe I'll submit a PR. We will probably have to create a separate (single point) method that opens up sockets with consideration of the provided proxies which may be at least of types http
with connect, socks5
and socks4
I haven't had any issues with bans, only rate limiting.
From what I understand (correct me if I'm not reading this correctly), if a proxy is provided, you want the ability to route all traffic (DNS, WHOIS, HTTP) over socks4/5/HTTP proxy? Also you mention proxies (plural); do you mean to load-balance across multiple proxy ips, or specify a different proxy server per lookup method?
Maybe you can clarify a bit. For instance, ASN lookups are best performed over DNS (https://github.com/secynic/ipwhois/blob/master/ipwhois/net.py#L217), but fallback to whois (https://github.com/secynic/ipwhois/blob/master/ipwhois/net.py#L285) and HTTP (https://github.com/secynic/ipwhois/blob/master/ipwhois/net.py#L384).
Edit: I just realized I skipped over your comment on CONNECT. Please elaborate on this, as I think you may be hinting at persistent connections, which wouldn't apply to the REST (RDAP) queries but may apply to the ASN lookups.
I'm providing the proxy per lookup. Nothing fancy as this is something to be done by the user and I don't see it fit inside your package honestly as it's not a one size fits all type of thing.
Regarding traffic, whoever uses proxies does it for a reason so all traffic originating from this package should be done via the supplied proxy. People do it to overcome limitations such as banning or rate limiting.
Regarding the CONNECT
bit, it was just a hint. I was referring to the parts when you open up sockets. Using http proxy inside the socket would require you to write some CONNECT
directives into it (ofc the proxy would have to support CONNECT and have the port 43 open).
Here's a bit from what I'm using on a similar project:
def http_proxy_connect(address=None, proxy=None, auth=None):
def valid_address(addr):
""" Verify that an IP/port tuple is valid """
return isinstance(addr, (list, tuple)) and len(addr) == 2 and isinstance(addr[0], str) and isinstance(addr[1], int)
if not valid_address(address):
raise ValueError('Invalid target address')
if not proxy:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(address)
return s, 0, {}
if not valid_address(proxy):
raise ValueError('Invalid proxy address')
_headers = {
'host': address[0]
}
if auth:
if isinstance(auth, str):
_headers['proxy-authorization'] = auth
elif auth and isinstance(auth, (tuple, list)):
if len(auth) == 1:
raise ValueError('Invalid authentication specification')
t = auth[0]
args = auth[1:]
if t.lower() == 'basic' and len(args) == 2:
auth_basic = "%s:%s" % args
_headers['proxy-authorization'] = 'Basic ' + str(to_base64(auth_basic))
else:
raise ValueError('Invalid authentication specification')
else:
raise ValueError('Invalid authentication specification')
s = socket.socket()
s.connect(proxy)
fp = s.makefile('rw')
fp.write('CONNECT %s:%d HTTP/1.1\r\n' % address)
fp.write('\r\n'.join('%s: %s' % (k, v) for (k, v) in _headers.items()) + '\r\n\r\n')
fp.flush()
statusline = fp.readline().rstrip('\r\n')
if statusline.count(' ') < 2:
fp.close()
s.close()
raise IOError('Bad response')
version, _status, statusmsg = statusline.split(' ', 2)
if not version in ('HTTP/1.0', 'HTTP/1.1'):
fp.close()
s.close()
raise IOError('Unsupported HTTP version')
try:
_status = int(_status)
except ValueError:
fp.close()
s.close()
raise IOError('Bad response')
response_headers = {}
while True:
tl = ''
l = fp.readline().rstrip('\r\n')
if l == '':
break
if not ':' in l:
continue
k, v = l.split(':', 1)
response_headers[k.strip().lower()] = v.strip()
fp.close()
return s, _status, response_headers
Understood. Actually, I originally wrote the proxy support for corporate networks that commonly block outbound port 43. The library started with only whois, and more recently the RDAP protocol was introduced, so support was tacked on, and the library re-written. This was before anon proxies/vpns were very popular for these types of things.
That being said, you make a good point. Let me look over your code and see what we can do.
Thanks for the detailed info.
Update: I apologize for the delays. I have been busy with work and other side projects.
This is currently sitting in priority behind:
@ciokan I added bulk lookup support in experimental.py (ipwhois.experimental.bulk_lookup_rdap): https://github.com/secynic/ipwhois/blob/dev/ipwhois/experimental.py
You won't need to worry about getting banned for the ASN lookups, since the Cymru bulk ASN lookup can be done with a single request (ipwhois.experimental.get_bulk_asn_whois).
I believe this will solve your problem (at least for the short term). I would like to get v1.0.0 out soon, so I will open a new issue linked to this to be addressed in 1.x.x for individual queries. Let me know your thoughts, and if you get a chance to test.
Moving to 1.1.0 to remove any confusion, instead of opening a new issue.
@ciokan Did you get a chance to test this?
I see sockets still being used (without proxy) even when a proxy opener is provided and
obj.lookup_rdap
used. Is this safe when doing a lot of requests?