spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
https://freek.dev/308-building-a-crawler-in-php
MIT License
2.51k stars 357 forks source link

Best way to limit similar urls #376

Closed aaronbauman closed 2 years ago

aaronbauman commented 2 years ago

Use case: I'm using a crawler to build a visual regression test battery, and I want to make it efficient.

So, I want to tell the crawler to limit similar URLs. -- I want to crawl all top-level URLs -- For each sub-directory, i only want 3 sub-pages ---- For example, I want to collect /about, /contact, /jobs, /news, and /blog ---- But given a set of job listings /jobs/1, /jobs/2, /jobs/3, /jobs/4, /jobs/5, /jobs/6 - i only want the first 3

Not sure where to start with this - would you suggest a crawl profile, a crawl queue, something else?? Thanks

Redominus commented 2 years ago

Have you tried to use maximum crawl depth config?

aaronbauman commented 2 years ago

Thanks for the follow up. I did try max crawl depth, but I actually do not want to limit depth, just breadth for any particular subdirectory.

The solution I came up with is a relatively thin implementation of Spatie\Crawler\CrawlQueue\CrawlQueue, here's the code in case it helps someone else: https://gist.github.com/aaronbauman/863c781f48572e644ca6b26d451653a6