I'm trying to figure out, that that works.
So far, I've connected my spider to redis with 3 test-domains.
When I start the spider, I can see the first hit to the websites.
What I don't understand now is:
How are the URLs that the LinkExtractor finds fed back into Redis?
And I assume, my cralwer is being "stopped" at:
domain = kwargs.pop('domain', '')
kwargs is always an empty dict.
Where does it come from?
It seems like, I initialize self.allowed_domains with an empty list of domains - so the crawler can't start.
How to do it right?
`
class MyCrawlerSpider(RedisCrawlSpider):
"""Spider that reads urls from redis queue (mycrawler:start_urls)."""
name = "redis_my_crawler"
redis_key = 'mycrawler:start_urls'
rules = (
Rule(LinkExtractor(), follow=True, process_links="filter_links"),
Rule(LinkExtractor(), callback='parse_page', follow=True, process_links="filter_links"),
)
def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
print('Init')
print(args)
print(kwargs)
domain = kwargs.pop('domain', '')
print(domain)
self.allowed_domains = filter(None, domain.split(','))
print(self.allowed_domains)
super(MyCrawlerSpider, self).__init__(*args, **kwargs)
def filter_links(self, links):
allowed_strings = ('news')
allowed_links = []
for link in links:
if (any(s in link.url.lower() for s in allowed_strings)
and any(domain in link.url for domain in self.allowed_domains)):
print(link)
allowed_links.append(link)
return allowed_links
def parse_page(self, response):
print(response.url)
return None
Description
Hello,
I'm trying to figure out, that that works. So far, I've connected my spider to redis with 3 test-domains. When I start the spider, I can see the first hit to the websites.
What I don't understand now is: How are the URLs that the LinkExtractor finds fed back into Redis?
And I assume, my cralwer is being "stopped" at:
domain = kwargs.pop('domain', '')
kwargs is always an empty dict. Where does it come from?It seems like, I initialize
self.allowed_domains
with an empty list of domains - so the crawler can't start. How to do it right?` class MyCrawlerSpider(RedisCrawlSpider): """Spider that reads urls from redis queue (mycrawler:start_urls).""" name = "redis_my_crawler"
`