spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
https://freek.dev/308-building-a-crawler-in-php
MIT License
2.51k stars 357 forks source link

Modify URLs before crawl #440

Closed codealfa closed 1 year ago

codealfa commented 1 year ago

Overview

As requested in issue #259, I also needed to modify the URLs to be crawled, but the solution suggested didn't work. The URL passed to willCrawl of an observer is an instance of Psr\Http\Message\UriInterface, which is immutable. Further, this method doesn't return any values, so there's no easy way to modify the URL. Even if you could modify the URL here, it doesn't work because you're now changing the id of the URL in the queue, so the crawler can no longer recognize if the URL was already crawled. The same URL then gets crawled over and over until the crawl limit is reached. We'd need to modify the URL before it is added to the crawl queue.

Use Case

I use the crawler in a plugin I develop to optimize websites, which modifies the HTML of the site on the run. Sometimes, I need the original HTML before modification by the plugin. I can achieve this by passing a query to the URL, such as 'nooptimize=1', which notifies the plugin that it shouldn't modify the HTML on this request. So I need to pass this query to each URL to be crawled so I can access the 'unoptimized' HTML while the plugin is optimizing the HTML for other requests.

Solution

I implemented this by adding a new method, filterCrawlUrl, to the abstract CrawlObserver that custom child observers can overwrite to make modifications if required. Otherwise, it simply passes the same URL back to the crawler. A new method, Crawler::createCrawlUrl, is assigned the responsibility to create the CrawlUrl and passes the created instance to each observer to filter. All instances of CrawlUrl::create are replaced by Crawler::createCrawlUrl. This way, the URL gets modified before it is added to the crawl queue.

Caveat

Since the links extracted from the HTML are added to the depth tree before they can be modified, you can't set a maximum depth when using the crawler in this way. Setting a maximum depth to any value effectively sets it to 1. You'd have to use some other means to limit the number of links crawled, such as setCurrentCrawlLimit, or setTotalCrawlLimit.

Conclusion

With this implementation, the crawler can now allow URLs to be modified before crawled. This also doesn't affect existing codes, as custom observers do not need to implement the filterCrawlUrl if not required. I've added a unit test which shows links being modified as expected and with existing tests passing without any modifications.

I know it may seem a bit much as the crawler is very popular, and this doesn't seem like a much-requested feature, but it was essential for my use case, so I'm hoping it can be implemented. I'm currently using a modified fork of a previous version because I need to support PHP 7.4. Still, hopefully, in time, I can switch back to the main repo to benefit from added features and improvements.

freekmurze commented 1 year ago

As adding a method to an abstract class is a breaking change, I'm not going to merge this in right now.

I'll revisit this when creating a new major version of the crawler.

In the meantime you can keep using the code of your PR.

codealfa commented 1 year ago

Understood. Thanks for reviewing, and I look forward to this feature being added sometime in the future.

Redominus commented 1 year ago

In the meantime may I suggest using pushing an url rewrite middleware to the crawler's Guzzlehttp configuration?

freekmurze commented 1 year ago

Feel free to PR that, make sure to don't make breaking changes, update the readme and update tests.

codealfa commented 1 year ago

I found a better way to modify URLs before crawling without requiring any changes to core codes. You can create a custom CrawlQueue, or even extend an existing one, and modify the URLs there as they are added or accessed from the queue. For example:

class ModifyUrlCrawlQueue extends ArrayCrawlQueue
{
    public function add(CrawlUrl $crawlUrl)
    {
        $crawlUrl = $this->modifyUrl($crawlUrl);

        return parent::add($crawlUrl);
    }

    public function has($crawlUrl)
    {
        $crawlUrl = $this->modifyUrl($crawlUrl);

        return parent::has($crawlUrl);
    }

    private function modifyUrl($url)
    {
        //Your Implementation
    }
}

Then you just use your custom CrawlQueue in the crawler. In this way, there are no caveats as above, as the ids in the queue are based on the modified URLs. There's no need to revisit this, I've deleted my branch, and I'm using the main repo with a similar implementation, and everything works well.

Thanks for considering, though, and for such a wonderful and useful product.