spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
https://freek.dev/308-building-a-crawler-in-php
MIT License
2.51k stars 357 forks source link

Depth tree structure #447

Closed Xaelp closed 10 months ago

Xaelp commented 10 months ago

Currently the $depthTree property at vendor/spatie/crawler/src/Crawler.php has nodes inserted using depth first search algorithm and I find it very inconvenient:

Let's assume we crawl a website that has 4 paths: /, /index, /terms and /terms1. We can also assume that the crawling order is the same as the order of the paths described above. Then, the $depthTree at the end of the crawl will look like:

/ 
 /index
   /index
   /terms
     /terms-1
 /terms

So, if I ask: what is the depth() of the node with value /terms-1? The answer, with the structure above, is 3. However, I think (arguably) that it should be 2. The method to compute the depth seems correct.

In my point of view, the problem is with the structure that should insert nodes using breadth first search algorithm:

/ 
 /index
   /index
   /terms
 /terms
   /terms-1

Because we start from the root /, and, on this way, we always have the shortest path till our node and the $maximumDepth setting can easily be explained as the minimum amount of clicks to get to the link.

This is an example of an implementation using the BFS:

public function addToDepthTree(UriInterface $url, UriInterface $parentUrl, Node $node = null): ?Node
    {
        if (is_null($this->maximumDepth)) {
            return new Node((string) $url);
        }

        $queue = new SplQueue(); // Use a queue for BFS
        $queue->enqueue($this->depthTree);

        while (!$queue->isEmpty()) {
            $node = $queue->dequeue();

            if ($node->getValue() === (string) $parentUrl) {
                $newNode = new Node((string) $url);
                $node->addChild($newNode);
                return $newNode;
            }

            foreach ($node->getChildren() as $currentNode) {
                $queue->enqueue($currentNode); // Enqueue children for BFS
            }
        }

        return null;
    }
Xaelp commented 10 months ago

This was moved into a discussion https://github.com/spatie/crawler/discussions/448 as it is not an issue (yet)