scrapinghub / portia

Visual scraping for Scrapy
BSD 3-Clause "New" or "Revised" License
9.28k stars 1.41k forks source link

Crawler regex to exclude links #703

Closed CTPAYC23 closed 7 years ago

CTPAYC23 commented 7 years ago

Hi,

I'm crawling a site with URLs in the format: http://website.com/product/brand/id123/

I'm trying to exclude links like: http://website.com/product/brand/id123/doc/ http://website.com/product/brand/id123/txt/

so configuring crawling exclusions. Tried: doc /doc/ \/doc\/ .*\/doc\/

but still seeing a lot of crawling attempts for http://website.com/product/brand/id123/doc/ pages in the request log of Scrapinghub. The ratio of scraped to requests is too low and the job gets eventually stopped.

What is the right way of excluding URLs like these please?

ruairif commented 7 years ago

Can you using a follow rule like: \d+(/$|$) instead?

CTPAYC23 commented 7 years ago

@ruairif , ok trying that now. Can I still combine follow and exclude rules successfully?

ruairif commented 7 years ago

You can. It follows all links that match the follow rules but not the exclude rules