Closed ArturGaspar closed 8 years ago
83.30%
Merging #1735 into master will increase coverage by +0.01% as of
380e480
Powered by Codecov. Updated on successful CI builds.
LGTM.
I feel this component needs a slight rework.
Robots Request
and appending other requests to the result
should be 2 different methods/callbacks
.netloc
removed? Shouldn't it just store some value, the failure maybe? Removing it provokes the request to be retried in the future even tho it is probably already retried by the RetryMiddleware
.Request.callback and Request.errback
? It would make it easier to understand.@nramirezuy can refactoring this middleware be a separate issue? It has a few other problems: it doesn't handle Crawl-Delay directives (https://github.com/scrapy/scrapy/issues/892), and it breaks on robots.txt files found in the wild (https://github.com/scrapy/scrapy/issues/754).
or do you feel refactoring will help for this particular issue?
@kmike Yes, the second point should fix for the issue. Just avoid every request to a domain which robots.txt
failed to download. For example what if the domain just blocked the UA you using or the Crawler IP.
Storing the failure might be overboard, just storing None should be enough. Also you can collect stats and see the different failure you are getting on robots requests.
@nramirezuy I'm not sure why I was removing the netloc in the first place. All tests still pass when leaving it as none, and it is also what the middleware already did, so I'm making it this way again.
I don't understand your first suggestion, and I'm not using callback and errback because the previous code didn't use it. I think blocking every request if robots.txt fails downloading is inconsistent with the behaviour on other errors (and incompatible with the previous behaviour on download errors), which is to consider it as allowing all.
@ArturGaspar I just looked at the previous version before the defer. And looks like the robots.txt requests was done only once and if it failed it let you download the from the site, it also let you download while the robots request wasn't handled.
Can we document the behavior? With defer implementation the behavior changed, now it waits for the robots request to be downloaded.
@nramirezuy I think that not downloading when robots.txt request fails is a bug, but waiting for robots.txt to be downloaded before proceeding with the main website is a bug fix.
@kmike I agree, is how I suppose to be implemented. But we should also documented, because it will inform users of the middleware and will help us to review on the future.
@nramirezuy The documentation mentioned it as a limitation, and #1473 removed that warning from the documentation.
@ArturGaspar But we should add a warning about when robots request fails (when it reach the errback), also tell that it respect the Downloader Middlewares
but doesn't use the Spider Middlewares
.
Sorry, I'm just dumping everything while I'm on it. Otherwise it get lost on the void.
@nramirezuy Isn't it better to make a separate issue for improving documentation on this middleware?
Sorry, I'm falling behind :) What does currently happen if robots.txt request falis?
@kmike
Is it retried?
By the default retry middleware. (Without this PR, it would be retried on every new request to a domain, and only remembered for that domain if succeeded.)
If robots.txt request fails, is the main request sent? If so, does it happen before or after retries?
Yes. After the retries.
Is an error with robots.txt logged?
Yes.
Is the behaviour different in Scrapy 1.0.x?
With this PR, no, except for the previously documented limitation (that in 1.0 the first request is made before robots.txt, and some requests can still go through before it was downloaded).
:+1: that's perfect
Is it retried?
By the default retry middleware. (Without this PR, it would be retried on every new request to a domain, and only remembered for that domain if succeeded.)
Depends on the position of RobotsTxtMiddleware
on the chain, by default answer is valid.
If robots.txt request fails, is the main request sent? If so, does it happen before or after retries?
Yes. After the retries.
Depends on the position of RobotsTxtMiddleware
on the chain, by default answer is valid.
Is the behaviour different in Scrapy 1.0.x?
With this PR, no, except for the previously documented limitation (that in 1.0 the first request is made before robots.txt, and some requests can still go through before it was downloaded).
1.0.x behavior was bogus, so it have to change. Maybe in another PR.
@ArturGaspar can you create PR changing behavior https://github.com/scrapy/scrapy/pull/1735#issuecomment-176225527 with it documentation?
@nramirezuy I think position of the middleware doesn't matter, because it calls engine.download() for a new request, which goes through the entire middleware chain. Isn't that so?
@ArturGaspar can you create PR changing behavior #1735 (comment) with it documentation?
Sorry, I'm confused. You mean documenting only what was improved in #1473 and this PR, or the behaviour that was already present (what happens when robots.txt fails etc.)?
@ArturGaspar Yea you are right it will work regardless. What I'm not sure is if it gets propagated to the SpiderMiddlewares
and if is the same for error and normal.
We should first fix this https://github.com/scrapy/scrapy/pull/1735#issuecomment-176225527, then document the change. And also will be useful to document that it indeed passes through the DownloaderMiddlewares
, and if it gets propagated to SpiderMiddlewares
too. So we don't have to science it every time we want to use it.
@nramirezuy sorry, what exactly should we fix?
I think that not downloading when robots.txt request fails is a bug
This one should be settable, https://github.com/scrapy/scrapy/pull/1735#issuecomment-175846419 the site can even forbid you from access robots.txt
.
Hey fellows, any chance of we agree and wrap this up to include in the next release?
I see this has a merge vote already and it improve things already, I would've had merged it if there wasn't this additional discussion (bikeshedding? :bike: :circus_tent: ) :)
Should the additional work (refactoring + documenting) be a separate issue?
Hey @kmike @nramirezuy @ArturGaspar, I'll merge this one now in order to move forward with the 1.1 RC release. @nramirezuy can you please open a ticket for the follow-up?
Fixes #1733