spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
https://freek.dev/308-building-a-crawler-in-php
MIT License
2.53k stars 358 forks source link

Crawler processes out of scope page after redirect #340

Closed spekulatius closed 3 years ago

spekulatius commented 3 years ago

Hello @freekmurze

something I've noticed working on my crawler project for Rankletter.com (and tested on) is:

In my case I redirect /contact to rankletter.com/contact to my blog peterthaleikis.com/blog (as a temporary solution more or less), these the crawler picks up relative links as if they were on the domain I'm crawling (here rankletter.com).

On looking over the code briefly, it looks like the CrawlRequestFulfilled classes needs to be extended for a check on this.

Just thought I let you know and check in if this is known (and maybe the reason for switching the redirects off in the first place?)

Cheers, Peter

freekmurze commented 3 years ago

To be honest, its been a while since I coded that part up, and I don't know anymore if it's intended or not 😬

spekulatius commented 3 years ago

That's fair enough. totally get the "Ehm, yeah, maybe, maybe not. I can't remember"-feeling :smile: