scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

[MRG+1] Fix for KeyError in robots.txt middleware #1735

Closed ArturGaspar closed 8 years ago

ArturGaspar commented 8 years ago

Fixes #1733

codecov-io commented 8 years ago

Current coverage is 83.30%

Merging #1735 into master will increase coverage by +0.01% as of 380e480

Powered by Codecov. Updated on successful CI builds.

kmike commented 8 years ago

LGTM.

nramirezuy commented 8 years ago

I feel this component needs a slight rework.

kmike commented 8 years ago

@nramirezuy can refactoring this middleware be a separate issue? It has a few other problems: it doesn't handle Crawl-Delay directives (https://github.com/scrapy/scrapy/issues/892), and it breaks on robots.txt files found in the wild (https://github.com/scrapy/scrapy/issues/754).

kmike commented 8 years ago

or do you feel refactoring will help for this particular issue?

nramirezuy commented 8 years ago

@kmike Yes, the second point should fix for the issue. Just avoid every request to a domain which robots.txt failed to download. For example what if the domain just blocked the UA you using or the Crawler IP.

Storing the failure might be overboard, just storing None should be enough. Also you can collect stats and see the different failure you are getting on robots requests.

ArturGaspar commented 8 years ago

@nramirezuy I'm not sure why I was removing the netloc in the first place. All tests still pass when leaving it as none, and it is also what the middleware already did, so I'm making it this way again.

I don't understand your first suggestion, and I'm not using callback and errback because the previous code didn't use it. I think blocking every request if robots.txt fails downloading is inconsistent with the behaviour on other errors (and incompatible with the previous behaviour on download errors), which is to consider it as allowing all.

nramirezuy commented 8 years ago

@ArturGaspar I just looked at the previous version before the defer. And looks like the robots.txt requests was done only once and if it failed it let you download the from the site, it also let you download while the robots request wasn't handled.

nramirezuy commented 8 years ago

Can we document the behavior? With defer implementation the behavior changed, now it waits for the robots request to be downloaded.

kmike commented 8 years ago

@nramirezuy I think that not downloading when robots.txt request fails is a bug, but waiting for robots.txt to be downloaded before proceeding with the main website is a bug fix.

nramirezuy commented 8 years ago

@kmike I agree, is how I suppose to be implemented. But we should also documented, because it will inform users of the middleware and will help us to review on the future.

ArturGaspar commented 8 years ago

@nramirezuy The documentation mentioned it as a limitation, and #1473 removed that warning from the documentation.

nramirezuy commented 8 years ago

@ArturGaspar But we should add a warning about when robots request fails (when it reach the errback), also tell that it respect the Downloader Middlewares but doesn't use the Spider Middlewares. Sorry, I'm just dumping everything while I'm on it. Otherwise it get lost on the void.

ArturGaspar commented 8 years ago

@nramirezuy Isn't it better to make a separate issue for improving documentation on this middleware?

kmike commented 8 years ago

Sorry, I'm falling behind :) What does currently happen if robots.txt request falis?

ArturGaspar commented 8 years ago

@kmike

Is it retried?

By the default retry middleware. (Without this PR, it would be retried on every new request to a domain, and only remembered for that domain if succeeded.)

If robots.txt request fails, is the main request sent? If so, does it happen before or after retries?

Yes. After the retries.

Is an error with robots.txt logged?

Yes.

Is the behaviour different in Scrapy 1.0.x?

With this PR, no, except for the previously documented limitation (that in 1.0 the first request is made before robots.txt, and some requests can still go through before it was downloaded).

kmike commented 8 years ago

:+1: that's perfect

nramirezuy commented 8 years ago

Is it retried?

By the default retry middleware. (Without this PR, it would be retried on every new request to a domain, and only remembered for that domain if succeeded.)

Depends on the position of RobotsTxtMiddleware on the chain, by default answer is valid.

If robots.txt request fails, is the main request sent? If so, does it happen before or after retries?

Yes. After the retries.

Depends on the position of RobotsTxtMiddleware on the chain, by default answer is valid.

Is the behaviour different in Scrapy 1.0.x?

With this PR, no, except for the previously documented limitation (that in 1.0 the first request is made before robots.txt, and some requests can still go through before it was downloaded).

1.0.x behavior was bogus, so it have to change. Maybe in another PR.

@ArturGaspar can you create PR changing behavior https://github.com/scrapy/scrapy/pull/1735#issuecomment-176225527 with it documentation?

ArturGaspar commented 8 years ago

@nramirezuy I think position of the middleware doesn't matter, because it calls engine.download() for a new request, which goes through the entire middleware chain. Isn't that so?

@ArturGaspar can you create PR changing behavior #1735 (comment) with it documentation?

Sorry, I'm confused. You mean documenting only what was improved in #1473 and this PR, or the behaviour that was already present (what happens when robots.txt fails etc.)?

nramirezuy commented 8 years ago

@ArturGaspar Yea you are right it will work regardless. What I'm not sure is if it gets propagated to the SpiderMiddlewares and if is the same for error and normal.

We should first fix this https://github.com/scrapy/scrapy/pull/1735#issuecomment-176225527, then document the change. And also will be useful to document that it indeed passes through the DownloaderMiddlewares, and if it gets propagated to SpiderMiddlewares too. So we don't have to science it every time we want to use it.

kmike commented 8 years ago

@nramirezuy sorry, what exactly should we fix?

nramirezuy commented 8 years ago

I think that not downloading when robots.txt request fails is a bug

This one should be settable, https://github.com/scrapy/scrapy/pull/1735#issuecomment-175846419 the site can even forbid you from access robots.txt.

eliasdorneles commented 8 years ago

Hey fellows, any chance of we agree and wrap this up to include in the next release?

I see this has a merge vote already and it improve things already, I would've had merged it if there wasn't this additional discussion (bikeshedding? :bike: :circus_tent: ) :)

Should the additional work (refactoring + documenting) be a separate issue?

eliasdorneles commented 8 years ago

Hey @kmike @nramirezuy @ArturGaspar, I'll merge this one now in order to move forward with the 1.1 RC release. @nramirezuy can you please open a ticket for the follow-up?