Closed nyov closed 4 years ago
I don't want to discourage your contribution - it looks good. But I think we should provide another resolver, based on twisted.names, which can be enabled optionally. It should solve issues with IPv6, as well as performance issues when a large number of domains is crawled.
See also:
This PR can be still useful though, for the cases twisten.names is not suitable.
Yeah I agree, and I did look into that in parallel, starting here https://twistedmatrix.com/documents/14.0.1/names/howto/client-tour.html#creating-a-new-resolver.
But as I'm not sure if I find time to got through with the t.n.c approach, and this conversion was easy to accomplish and fixing issues with the current one, I just finished it first.
1092 looks like a nice addon, playing around with the threadpool size. But the underlying blocking code is still the same.
Personally, for massive dns lookups I wouldn't really go with either approach.
I've been playing around with a libunbound
-based setup there, which is not going through the host's facilities at all. Though I don't have nearly enough dns lookups to see a difference ;)
I think best choice for default Scrapy behavior is to use host DNS resolution capabilities, because Scrapy is general purpose framework and it should work in any conditions (/etc/hosts, LDAP) and be transparent enough for dns debugging. IMO, any approach implying passing to Scrapy nameservers list, or any other way to configure it for dns resolution should be optional and considered as advanced set up.
Regarding, IPv6 in Scrapy IMO, it is wrong if we will require DNS setup in order to make it work. I think it should work as IPv4, right out of the box.
Regarding non-blocking resolution performance in Scrapy. It's not clear that it will be faster than threaded model. Remember that GIL unblocks on IO calls. So, Scrapy main event-loop delegates DNS resolution to the thread pool, allowing to use other cores for that workload. In case of async model, Scrapy event loop will be serving all IO, including DNS, running all that cycles within same core.
@sibiryakov, I didn't mean to say scrapy should roll out with a resolver that circumvents host's facilities, sorry if it came across like that. That was just some personal pride I guess.
The twisted.names
approach could allow passing nameserver addresses directly to be used by the resolver, and I think supporting it would be nice, but yes it should be an optional feature.
Not sure what you mean with working with "IPv4 out of the box", I think it also should work with IPv6-only stacks out of the box - which is why I wrote this code.
It's working exactly like the currently used ThreadedResolver, only replacing it with a getaddrinfo
call.
So it's useable as a drop-in replacement and the risk that it behaves differently from the current implementation is minimal, where as a twisted.names
approach would be different and require more testing.
Using getaddrinfo
is also superior for IPv4-only setups, as it can return multiple IPs in a single call and with some code-changes we could load-balance requests across different IPs for a given domain or service.
I've rebased this on the unittest changes in #1117, tests were failing because it would resolve localhost to an IPv6 address, but the mock server didn't listen there.
I have also considered adding a few settings here for the resolver, such as
DNS_IPV4_ONLY
DNS_IPV6_ONLY
DNS_IPV4_PREFERRED
DNS_IPV6_PREFERRED
to prioritise the returned values.
Maybe this is all wasted effort, but for the interim, until we may have a twisted.names
approach for this, I would consider some IPv6 capability better than none.
(Another thought: should we really depend on this Twisted api for DNS if we consider other approaches like asyncio? Maybe staying higher level is a feature?)
Given the problems with twisted.names @sibiryakov found in https://github.com/scrapy/scrapy/pull/1092#issuecomment-86466566 this PR started to look appealing. It is not a wasted effort!
Another thought: should we really depend on this Twisted api for DNS if we consider other approaches like asyncio? Maybe staying higher level is a feature?
From user point of view currently there is a single DNSCACHE_ENABLED option. I think we are free to change the actual implementation.
DNS_IPV4_ONLY
DNS_IPV6_ONLY
DNS_IPV4_PREFERRED
DNS_IPV6_PREFERRED
I'm fine with these settings, but there should be 1 or 2 options instead of 4 because they are mutually exclusive.
This would be great for a simpler implementation also. I'm only missing a good name for such an option. DNS_RESOLUTION_PREFERENCE
? it's a bit of a mouthful.
Added this resolver preference option. The sorting code looks a bit awkward though.
81.89%
Merging #1104 into master will decrease coverage by -0.28% as of
2fa9888
@@ master #1104 diff @@
======================================
Files 165 165
Stmts 8153 8195 +42
Branches 1134 1141 +7
Methods 0 0
======================================
+ Hit 6699 6711 +12
- Partial 263 269 +6
- Missed 1191 1215 +24
Review entire Coverage Diff as of
2fa9888
Powered by Codecov. Updated on successful CI builds.
@@ master #1104 diff @@
==========================================
Files 161 161
Lines 8730 8772 +42
Methods 0 0
Messages 0 0
Branches 1285 1292 +7
==========================================
+ Hits 7279 7307 +28
- Misses 1201 1210 +9
- Partials 250 255 +5
Powered by Codecov. Last update de89b1b...e0b211d
Thanks @nyov ! @sibiryakov , @kmike , how does the PR look for you?
@redapple the code looks good, but I haven't tried it yet.
For what it's worth, this seems like an odd thing for Scrapy to try to implement itself. Twisted's HostnameEndpoint
will do not only the right IPv6 lookup, but also implement the fast connection algorithm that goes along with getaddrinfo
.
Twisted doesn't currently use this endpoint for HTTP, but it definitely should (and work has been underway for some time to make it do so); it will start doing so in an upcoming release, almost certainly this year.
In the meanwhile, if you want to hack it in yourself, there's an appropriate extension API, endpointForURI
, which you can implement and pass to an Agent
.
Thank you for the info, @glyph.
You are probably right that this is a problem which shouldn't necessarily be solved in Scrapy itself.
From what @kmike mentioned, it seemed a solution using twisted.names
was being investigated? HostnameEndpoint
features look interesting, but endpointForURI
seems rather new itself, from twisted 15.0.
I'm not sure what's the best way to implement this here. My understanding of the twisted library is, admittedly, limited, and not having the time to investigate properly. I hope someone else here will look into your suggestions!
This code was written from "necessity", as usual. It seemed good enough, replicating things exactly like twisted did with the gethostbyname
resolution, just replacing the function lookup to use getaddrinfo
. From that perspective it wasn't doing things any worse than the code before. Though that was before adding the IPv6 configurable stuff.
This code was for myself initially, after becoming aware of the "GHOST" vulnerability (CVE-2015-0235) and not quite being able to figure out if a crawler might be exploited by a target website. A bit later I added the IPv6 lookup features and shared it, in response to someone else looking for some IPv6 support. I'm still using it until some other solution comes around.
As an aside, gethostbyname
is so outdated, it really shouldn't be used anywhere anymore. E.g. the Postfix homepage says about it:
GHOST Attack: Postfix does not call gethostbyname() since 2005. There is no Postfix code that invokes this function unless Postfix is specifically built for operating systems from more than 10 years ago.
@glyph's opinion has serious weight to us though, I think.
The others will probably agree. @kmike, @redapple, @sibiryakov, what's your take?
If this is unlikely to see a merge then it seems I should maybe invest some time to make it available as a scrapy-plugin for other users in the interim?
We just released Twisted 16.0, which includes wrapClientTLS
, the relevant API for getting HTTPS on your outbound HostnameEndpoint
s, so things are even easier now :).
Work is proceeding on Twisted for this issue; I just submitted the last building-block, which will automatically try IPv6 hosts using the Happy Eyeballs algorithm by default with Agent: https://github.com/twisted/twisted/pull/624
Also of interest, in case you missed it - one of the preceding bits of work involved in Twisted here is to allow for plugging in a DNS resolver that can resolve multiple addresses, unlike the previously somewhat limited IPv4-only, one-address-only pluggable resolver on the reactor. You can see that work here: https://twistedmatrix.com/trac/ticket/4362 until it's in the docs for a new release.
That pluggable resolver might be of interest if you have good reason to implement oddball name resolution policies you've outlined above, like 46
:).
The twisted ticket PR has landed, so it should be available in the next release.
So, what about IPv6 in Scrapy?
@w495, I personally cannot say if it works. I would say no. For sure, it is not currently tested. And Scrapy looks incompatible with upcoming Twisted 16.7. So there's work.
Has been superseded/implemented by #4227 Closing this.
For #1031
I do not currently have the chance to test this in a v6-only environment. Feedback appreciated.