Crawler processes out of scope page after redirect

spekulatius commented 3 years ago

Hello @freekmurze

something I've noticed working on my crawler project for Rankletter.com (and tested on) is:

301 redirects aren't picked up, unless passing the follow redirect flag (RequestOptions::ALLOW_REDIRECTS) to the crawler (expected)
with activated flag, the crawler also processes the external page independently if is scope or not.

In my case I redirect /contact to rankletter.com/contact to my blog peterthaleikis.com/blog (as a temporary solution more or less), these the crawler picks up relative links as if they were on the domain I'm crawling (here rankletter.com).

On looking over the code briefly, it looks like the CrawlRequestFulfilled classes needs to be extended for a check on this.

Just thought I let you know and check in if this is known (and maybe the reason for switching the redirects off in the first place?)

Cheers, Peter

freekmurze commented 3 years ago

To be honest, its been a while since I coded that part up, and I don't know anymore if it's intended or not 😬

spekulatius commented 3 years ago

That's fair enough. totally get the "Ehm, yeah, maybe, maybe not. I can't remember"-feeling :smile:

spatie / crawler

Crawler processes out of scope page after redirect #340