Crawler regex to exclude links

CTPAYC23 commented 7 years ago

Hi,

I'm crawling a site with URLs in the format: http://website.com/product/brand/id123/

I'm trying to exclude links like: http://website.com/product/brand/id123/doc/ http://website.com/product/brand/id123/txt/

so configuring crawling exclusions. Tried: doc /doc/ \/doc\/ .*\/doc\/

but still seeing a lot of crawling attempts for http://website.com/product/brand/id123/doc/ pages in the request log of Scrapinghub. The ratio of scraped to requests is too low and the job gets eventually stopped.

What is the right way of excluding URLs like these please?

ruairif commented 7 years ago

Can you using a follow rule like: \d+(/$|$) instead?

CTPAYC23 commented 7 years ago

@ruairif , ok trying that now. Can I still combine follow and exclude rules successfully?

ruairif commented 7 years ago

You can. It follows all links that match the follow rules but not the exclude rules

scrapinghub / portia

Crawler regex to exclude links #703