Open mohmad-null opened 7 years ago
Not against the feature,, but one problem with disallowed_domains
is that your spider could end crawling the entire internet. Simple example: there is a link to yahoo.com
in some page you crawled ant it is not on disallowed_domains
Result: your spider will be possibly be crawling for days websites you don't want.
I had the same thoughts as @djunzu. What is your use case @mohmad-null?
I'm doing some general crawling down to a depth of 3 or 4 starting at various seed sites but not limiting it to certain domains.
I don't want to crawl certain big sites (like facebook, yahoo, google, etc) that everyone seems to point at, only "smaller" sites. For this scenario a blacklist is much more useful than a whitelist. I've implemented it in middleware for now but it'd still be a nice-to-have
@mohmad-null , would you want to share your middleware by any chance?
@redapple ; Sure. I have a global variable BLOCKED_DOMAINS set elsewhere like this:
BLOCKED_DOMAINS = open('\\path\to\file\blocked_domains.txt')).read().splitlines()
Which reads from a file that basically is just a list of URL's:
yahoo.com
google.com
facebook.com
....
And then the middleware class looks like this:
class BlockedDomains(object):
# Bad domains that are full of stuff we don't want.
def process_request(self, request, spider):
url = request.url.lower()
for domain in OGC_BLOCKED_DOMAINS:
if domain in url:
raise exceptions.IgnoreRequest
There are probably better (and almost certainly more optimised) ways to do this.
I guess this can be done by adding logic to get_host_regex
method in scrapy.spidermiddlewares.offsite.OffsiteMiddleware
. What do you think @Gallaecio ?
Yes, that‘s the middleware where this would be ideally implemented.
Feature request Currently there's "allowed_domains" to create a whitelist of domains to scrape.
It would be good if there was a "disallowed_domains" or "blocked_domains" as well. I appreciate I could could probably do this in middleware, but I figure it's something quite a few people would want.