scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

IPv6ThreadedResolver based on socket.getaddrinfo #1104

Closed nyov closed 4 years ago

nyov commented 9 years ago

For #1031

I do not currently have the chance to test this in a v6-only environment. Feedback appreciated.

kmike commented 9 years ago

I don't want to discourage your contribution - it looks good. But I think we should provide another resolver, based on twisted.names, which can be enabled optionally. It should solve issues with IPv6, as well as performance issues when a large number of domains is crawled.

See also:

This PR can be still useful though, for the cases twisten.names is not suitable.

nyov commented 9 years ago

Yeah I agree, and I did look into that in parallel, starting here https://twistedmatrix.com/documents/14.0.1/names/howto/client-tour.html#creating-a-new-resolver.

But as I'm not sure if I find time to got through with the t.n.c approach, and this conversion was easy to accomplish and fixing issues with the current one, I just finished it first.

nyov commented 9 years ago

1092 looks like a nice addon, playing around with the threadpool size. But the underlying blocking code is still the same.

Personally, for massive dns lookups I wouldn't really go with either approach. I've been playing around with a libunbound-based setup there, which is not going through the host's facilities at all. Though I don't have nearly enough dns lookups to see a difference ;)

sibiryakov commented 9 years ago

I think best choice for default Scrapy behavior is to use host DNS resolution capabilities, because Scrapy is general purpose framework and it should work in any conditions (/etc/hosts, LDAP) and be transparent enough for dns debugging. IMO, any approach implying passing to Scrapy nameservers list, or any other way to configure it for dns resolution should be optional and considered as advanced set up.

Regarding, IPv6 in Scrapy IMO, it is wrong if we will require DNS setup in order to make it work. I think it should work as IPv4, right out of the box.

Regarding non-blocking resolution performance in Scrapy. It's not clear that it will be faster than threaded model. Remember that GIL unblocks on IO calls. So, Scrapy main event-loop delegates DNS resolution to the thread pool, allowing to use other cores for that workload. In case of async model, Scrapy event loop will be serving all IO, including DNS, running all that cycles within same core.

nyov commented 9 years ago

@sibiryakov, I didn't mean to say scrapy should roll out with a resolver that circumvents host's facilities, sorry if it came across like that. That was just some personal pride I guess. The twisted.names approach could allow passing nameserver addresses directly to be used by the resolver, and I think supporting it would be nice, but yes it should be an optional feature.

Not sure what you mean with working with "IPv4 out of the box", I think it also should work with IPv6-only stacks out of the box - which is why I wrote this code. It's working exactly like the currently used ThreadedResolver, only replacing it with a getaddrinfo call. So it's useable as a drop-in replacement and the risk that it behaves differently from the current implementation is minimal, where as a twisted.names approach would be different and require more testing. Using getaddrinfo is also superior for IPv4-only setups, as it can return multiple IPs in a single call and with some code-changes we could load-balance requests across different IPs for a given domain or service.

nyov commented 9 years ago

I've rebased this on the unittest changes in #1117, tests were failing because it would resolve localhost to an IPv6 address, but the mock server didn't listen there.

I have also considered adding a few settings here for the resolver, such as

DNS_IPV4_ONLY
DNS_IPV6_ONLY
DNS_IPV4_PREFERRED
DNS_IPV6_PREFERRED

to prioritise the returned values.

Maybe this is all wasted effort, but for the interim, until we may have a twisted.names approach for this, I would consider some IPv6 capability better than none. (Another thought: should we really depend on this Twisted api for DNS if we consider other approaches like asyncio? Maybe staying higher level is a feature?)

kmike commented 9 years ago

Given the problems with twisted.names @sibiryakov found in https://github.com/scrapy/scrapy/pull/1092#issuecomment-86466566 this PR started to look appealing. It is not a wasted effort!

kmike commented 9 years ago

Another thought: should we really depend on this Twisted api for DNS if we consider other approaches like asyncio? Maybe staying higher level is a feature?

From user point of view currently there is a single DNSCACHE_ENABLED option. I think we are free to change the actual implementation.

kmike commented 9 years ago
DNS_IPV4_ONLY
DNS_IPV6_ONLY
DNS_IPV4_PREFERRED
DNS_IPV6_PREFERRED

I'm fine with these settings, but there should be 1 or 2 options instead of 4 because they are mutually exclusive.

nyov commented 9 years ago

This would be great for a simpler implementation also. I'm only missing a good name for such an option. DNS_RESOLUTION_PREFERENCE? it's a bit of a mouthful.

nyov commented 9 years ago

Added this resolver preference option. The sorting code looks a bit awkward though.

codecov-io commented 8 years ago

Current coverage is 81.89%

Merging #1104 into master will decrease coverage by -0.28% as of 2fa9888

@@            master   #1104   diff @@
======================================
  Files          165     165       
  Stmts         8153    8195    +42
  Branches      1134    1141     +7
  Methods          0       0       
======================================
+ Hit           6699    6711    +12
- Partial        263     269     +6
- Missed        1191    1215    +24

Review entire Coverage Diff as of 2fa9888

Powered by Codecov. Updated on successful CI builds.

codecov-io commented 8 years ago

Current coverage is 83.29% (diff: 69.56%)

Merging #1104 into master will decrease coverage by 0.08%

@@             master      #1104   diff @@
==========================================
  Files           161        161          
  Lines          8730       8772    +42   
  Methods           0          0          
  Messages          0          0          
  Branches       1285       1292     +7   
==========================================
+ Hits           7279       7307    +28   
- Misses         1201       1210     +9   
- Partials        250        255     +5   

Powered by Codecov. Last update de89b1b...e0b211d

redapple commented 8 years ago

Thanks @nyov ! @sibiryakov , @kmike , how does the PR look for you?

kmike commented 8 years ago

@redapple the code looks good, but I haven't tried it yet.

glyph commented 8 years ago

For what it's worth, this seems like an odd thing for Scrapy to try to implement itself. Twisted's HostnameEndpoint will do not only the right IPv6 lookup, but also implement the fast connection algorithm that goes along with getaddrinfo.

Twisted doesn't currently use this endpoint for HTTP, but it definitely should (and work has been underway for some time to make it do so); it will start doing so in an upcoming release, almost certainly this year.

In the meanwhile, if you want to hack it in yourself, there's an appropriate extension API, endpointForURI, which you can implement and pass to an Agent.

nyov commented 8 years ago

Thank you for the info, @glyph. You are probably right that this is a problem which shouldn't necessarily be solved in Scrapy itself. From what @kmike mentioned, it seemed a solution using twisted.names was being investigated? HostnameEndpoint features look interesting, but endpointForURI seems rather new itself, from twisted 15.0.

I'm not sure what's the best way to implement this here. My understanding of the twisted library is, admittedly, limited, and not having the time to investigate properly. I hope someone else here will look into your suggestions!

This code was written from "necessity", as usual. It seemed good enough, replicating things exactly like twisted did with the gethostbyname resolution, just replacing the function lookup to use getaddrinfo. From that perspective it wasn't doing things any worse than the code before. Though that was before adding the IPv6 configurable stuff.

This code was for myself initially, after becoming aware of the "GHOST" vulnerability (CVE-2015-0235) and not quite being able to figure out if a crawler might be exploited by a target website. A bit later I added the IPv6 lookup features and shared it, in response to someone else looking for some IPv6 support. I'm still using it until some other solution comes around.

As an aside, gethostbyname is so outdated, it really shouldn't be used anywhere anymore. E.g. the Postfix homepage says about it:

GHOST Attack: Postfix does not call gethostbyname() since 2005. There is no Postfix code that invokes this function unless Postfix is specifically built for operating systems from more than 10 years ago.

nyov commented 8 years ago

@glyph's opinion has serious weight to us though, I think.

The others will probably agree. @kmike, @redapple, @sibiryakov, what's your take?

If this is unlikely to see a merge then it seems I should maybe invest some time to make it available as a scrapy-plugin for other users in the interim?

glyph commented 8 years ago

We just released Twisted 16.0, which includes wrapClientTLS, the relevant API for getting HTTPS on your outbound HostnameEndpoints, so things are even easier now :).

glyph commented 7 years ago

Work is proceeding on Twisted for this issue; I just submitted the last building-block, which will automatically try IPv6 hosts using the Happy Eyeballs algorithm by default with Agent: https://github.com/twisted/twisted/pull/624

glyph commented 7 years ago

Also of interest, in case you missed it - one of the preceding bits of work involved in Twisted here is to allow for plugging in a DNS resolver that can resolve multiple addresses, unlike the previously somewhat limited IPv4-only, one-address-only pluggable resolver on the reactor. You can see that work here: https://twistedmatrix.com/trac/ticket/4362 until it's in the docs for a new release.

glyph commented 7 years ago

That pluggable resolver might be of interest if you have good reason to implement oddball name resolution policies you've outlined above, like 46 :).

glyph commented 7 years ago

The twisted ticket PR has landed, so it should be available in the next release.

w495 commented 7 years ago

So, what about IPv6 in Scrapy?

redapple commented 7 years ago

@w495, I personally cannot say if it works. I would say no. For sure, it is not currently tested. And Scrapy looks incompatible with upcoming Twisted 16.7. So there's work.

nyov commented 4 years ago

Has been superseded/implemented by #4227 Closing this.