rmax / scrapy-redis

Redis-based components for Scrapy.
http://scrapy-redis.readthedocs.io
MIT License
5.54k stars 1.59k forks source link

How does the CrawlSpider work? #284

Open bch80 opened 1 year ago

bch80 commented 1 year ago

Description

Hello,

I'm trying to figure out, that that works. So far, I've connected my spider to redis with 3 test-domains. When I start the spider, I can see the first hit to the websites.

What I don't understand now is: How are the URLs that the LinkExtractor finds fed back into Redis?

And I assume, my cralwer is being "stopped" at: domain = kwargs.pop('domain', '') kwargs is always an empty dict. Where does it come from?

It seems like, I initialize self.allowed_domains with an empty list of domains - so the crawler can't start. How to do it right?

` class MyCrawlerSpider(RedisCrawlSpider): """Spider that reads urls from redis queue (mycrawler:start_urls).""" name = "redis_my_crawler"

redis_key = 'mycrawler:start_urls'

rules = (
    Rule(LinkExtractor(), follow=True, process_links="filter_links"),
    Rule(LinkExtractor(), callback='parse_page', follow=True, process_links="filter_links"),
)

def __init__(self, *args, **kwargs):
    # Dynamically define the allowed domains list.
    print('Init')
    print(args)
    print(kwargs)
    domain = kwargs.pop('domain', '')
    print(domain)
    self.allowed_domains = filter(None, domain.split(','))
    print(self.allowed_domains)
    super(MyCrawlerSpider, self).__init__(*args, **kwargs)

def filter_links(self, links):
    allowed_strings = ('news')
    allowed_links = []
    for link in links:
        if (any(s in link.url.lower() for s in allowed_strings)
            and any(domain in link.url for domain in self.allowed_domains)):
            print(link)
            allowed_links.append(link)

    return allowed_links

def parse_page(self, response):
    print(response.url)
    return None

`