spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
https://freek.dev/308-building-a-crawler-in-php
MIT License
2.51k stars 357 forks source link

Fix for crawler stopping issue #451

Closed buismaarten closed 9 months ago

buismaarten commented 9 months ago

When the CrawlSubdomains profile is not used, no additional code is used to match some hostnames in my opinion. Is there a specific reason to include this code?

Fixes issue #450

freekmurze commented 9 months ago

I think the current code is necessary: in normal circumstances we want to stop crawling when we're not on our own domain anymore, but if CrawlSubdomains is used, it is allowed.

buismaarten commented 9 months ago

When the CrawlAllUrls profile is used, is it needed to skip the URL? This crawler normally works when we're not on our own domain.

freekmurze commented 9 months ago

Yes, it is needed. Otherwise the crawl would start crawling other domains and never stop.

buismaarten commented 9 months ago

Maybe I don't fully understand it, but isn't the reason why we can specify a depth or a limit while crawling?

freekmurze commented 9 months ago

No, the idea of the Crawler is that you only crawl one domain, and not go outside that domain by default.

buismaarten commented 9 months ago

Is it an idea to make the if-statement configurable? Because the current crawler profiles cannot fix this problem at this moment.

freekmurze commented 9 months ago

Feel free to send a PR with a concrete proposal + tests on how to handle this better