scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
52.84k stars 10.53k forks source link

disallow_domains #2376

Open mohmad-null opened 7 years ago

mohmad-null commented 7 years ago

Feature request Currently there's "allowed_domains" to create a whitelist of domains to scrape.

It would be good if there was a "disallowed_domains" or "blocked_domains" as well. I appreciate I could could probably do this in middleware, but I figure it's something quite a few people would want.

djunzu commented 7 years ago

Not against the feature,, but one problem with disallowed_domains is that your spider could end crawling the entire internet. Simple example: there is a link to yahoo.com in some page you crawled ant it is not on disallowed_domains Result: your spider will be possibly be crawling for days websites you don't want.

kmike commented 7 years ago

I had the same thoughts as @djunzu. What is your use case @mohmad-null?

mohmad-null commented 7 years ago

I'm doing some general crawling down to a depth of 3 or 4 starting at various seed sites but not limiting it to certain domains.

I don't want to crawl certain big sites (like facebook, yahoo, google, etc) that everyone seems to point at, only "smaller" sites. For this scenario a blacklist is much more useful than a whitelist. I've implemented it in middleware for now but it'd still be a nice-to-have

redapple commented 7 years ago

@mohmad-null , would you want to share your middleware by any chance?

mohmad-null commented 7 years ago

@redapple ; Sure. I have a global variable BLOCKED_DOMAINS set elsewhere like this: BLOCKED_DOMAINS = open('\\path\to\file\blocked_domains.txt')).read().splitlines()

Which reads from a file that basically is just a list of URL's:

yahoo.com
google.com
facebook.com
....

And then the middleware class looks like this:

class BlockedDomains(object):

    # Bad domains that are full of stuff we don't want.
    def process_request(self, request, spider):
        url = request.url.lower()
        for domain in OGC_BLOCKED_DOMAINS:
            if domain in url:
                raise exceptions.IgnoreRequest

There are probably better (and almost certainly more optimised) ways to do this.

felipeboffnunes commented 1 year ago

I guess this can be done by adding logic to get_host_regex method in scrapy.spidermiddlewares.offsite.OffsiteMiddleware. What do you think @Gallaecio ?

Gallaecio commented 1 year ago

Yes, that‘s the middleware where this would be ideally implemented.