Closed pablohoffman closed 5 years ago
403s will only capture some of the bans, and will fail to work correctly in many cases. Replicating the crawlera ban detection logic is a huge amount of work, exporting the rules from crawlera requires making this public and/or open sourcing some of the crawlera internals.
Another drawback is how does this get enabled? Presumably a user must experience a ban somewhere first, then think to search for a solution or contact support? We can't preemptively do it since it will incur a cost and may delay or break some existing code. If we want users to enable it, we need to ask them and it may be irrelevant to them, or add a barrier somewhere we don't want one (e.g. upon signup).
I'd like to propose that we do not add this functionality here.
How about instead we send the user a notification when a job experiences a ban? Perhaps via intercom? We could provide instructions on how to sign up to crawlera, if needed, or just a link to enable for that spider and reschedule if desired. The ban detection could be implemented using the requests stored in hubstorage and the crawlera rules. It won't work for the case where the body of the request is needed, but this is very rare and there are work arounds.
The advantage of this approach is that we're telling the user about Crawlera at the precise time that it is useful to them. We can reuse our Crawlera rules and not worry about them leaking externally.
This is not to offer Crawlera to those who don't know it, this is to make it more convenient for Crawlera users to, well, use Crawlera with their spiders.
+1, the issue is that with this middleware, the decision to use is made at spider level -- either a spider uses crawlera or not.
If you want more fine grained control, you have to manually set dont_proxy
in request.meta.
It would be nice if it had the option to enable crawlera automatically if a certain number of requests are failing.
We could also support extending the rule for triggering the activation so as to not only be triggered by 403's.
Why did you choose 403 instead of 503 or 429? is it the most common ban code? I think it won't take long until someone propose to extend the http status list, or inspect the body looking for captcha boxes, or "Sorry, no results for you" text in 200 http responses. Users surely need different rules for different websites, so let's add that too.
I'm with @shaneaevans on this, we are going to recreate ban detection logic and expose what took years to collect and build as Crawlera business assets.
This is not to offer Crawlera to those who don't know it, this is to make it more convenient for Crawlera users to, well, use Crawlera with their spiders.
So the idea is not to help users find out about Crawlera as a solution to their (yet unknown?) banning or throttling problems, but to optionally route requests through Crawlera on first ban after they already activated Crawlera to help avoid bans and/or increase crawling rate.
If users already activated Crawlera because of bans, why would they want to use this enable-on-first-ban feature? Once enabled it won't persist across multiple job runs so each job will go through the delay of triggering the ban to enable the middleware.
We enable AutoThrottle extension by default because we want crawlers running in our platform to be polite, but this extension is not compatible with Crawlera, so there are 2 options:
I don't think option 2 is something we want
Option 1 is a problem too because AT is very conservative about throttling, so it rarely triggers a ban but doesn't provide good request rate either. This is another topic where Crawlera knowledge base about websites excels, Crawlera knows what is the req/s limit a website can sustain. Do we want to maintain a list of throttling limits per website on the extension? no please.
In general I prefer @shaneaevans idea to detect bans by mining information stored in Hubstorage and suggest our users to enable/disable Crawlera from the start per spider.
If users already activated Crawlera because of bans, why would they want to use this enable-on-first-ban feature?
Well, my understanding of the motivation is that users want to simply enable crawlera for ALL of their spiders (even ones that are currently working without bans), and the middleware would use crawlera for them automatically if needed (no need to enable it) to avoid using too much resources.
Yes, it would involve some sort of guessing on what is "needed", I was imagining it would be something a lot simpler than what's in Crawlera.
The point is easing the trouble of having to enable (or disable) on a spider-by-spider or job-by-job basis. It's not perfect, Crawlera isn't perfect either (and will never be). It's not about replicating all the Crawlera ban rules in the middleware either. I'll try to find some time to elaborate this more.
@pablohoffman did you find the time to elaborate this a little more 😄?
From my time in crawlera I am really with @dangra and @shaneaevans, it's very much dependent on the website to decide when to use or not crawlera.
The only use case I have seen, and it popped up this week on slack, was on broadcrawls when you don't want crawlera beforehand, but when some specific code pops up, you might want to enable crawlera for that request.
If scrapy-crawlera
is not supposed to be any smart about it and only the user configure the rules for enabling crawlera I guess it could be ok.
It would be great to enable Crawlera on demand once we detect bans from a website.
We could add some logic to the middleware to enable Crawlera once it detects the first 403 from the website, as opposed to be always on or off.
This would serve two purposes:
CRAWLERA_ENABLED = "auto"
and let it enable itself when neededWe could also support extending the rule for triggering the activation so as to not only be triggered by 403's.