How to filter offsite requests when using seedlist?

scrapinghub / frontera

A scalable frontier for web crawlers

BSD 3-Clause "New" or "Revised" License

1.3k stars 217 forks source link

How to filter offsite requests when using seedlist? #203

Closed SeanPollock closed 8 years ago

SeanPollock commented 8 years ago

Hi,

I've activated the offsite middleware in my project.

It works when I add the allowed_domains property to my spider.

But I have transitioned to loading my start urls with a seedlist using the FileSeedLoader middleware. Ideally I would like to remove the hard-coded allowed_domains property from my spider as well.

What is the recommended way of filtering offsite requests when using seedlists? Is there a way I can base the allowed_domains off this seedlist?

sibiryakov commented 8 years ago

Hi Sean,

yes that's possible. The links extracted within Scrapy spider are later followed, so if you want to follow only links within the same website you just need to write a simple custom code checking that extracted link hostname (or netloc) is the same as source page's one, otherwise don't yield extracted link to Frontera.

SeanPollock commented 8 years ago

Hi Alex,

Thanks for the response. I just wanted to make sure there wasn't a more canonical way of doing this.

By the way, great project. Coming from the apache nutch world, this is a breath of fresh air. A quickstart tutorial on starting frontera using kafka and hbase would be really useful for hitting the ground running though.

Cheers, Sean