spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
https://freek.dev/308-building-a-crawler-in-php
MIT License
2.51k stars 357 forks source link

Using a maximum depth prevents redirected links from being crawled #466

Closed superpenguin612 closed 1 month ago

superpenguin612 commented 2 months ago

If a maximum depth is specified while redirects are enabled and being tracked, any crawled links under a redirect will not be visited, since each link's parent URL will not match a node in the tree and will be discarded. This applies for cases where the initial crawl URL redirects, and where subsequent crawled URLs are redirects.

After a redirect gets followed, CrawlRequestFulfilled changes the base URL to the final link in the redirect history (presumably so that $foundOnUrl is most correct).

protected function getBaseUrl(ResponseInterface $response, CrawlUrl $crawlUrl): UriInterface
{
    $redirectHistory = $response->getHeader(RedirectMiddleware::HISTORY_HEADER);

    if (empty($redirectHistory)) {
        return $crawlUrl->url;
    }

    return new Uri(end($redirectHistory));
}

// implementation from __invoke:
$baseUrl = $this->getBaseUrl($response, $crawlUrl);
$this->urlParser->addFromHtml($body, $baseUrl);

This is then passed to LinkUrlParser, which adds each URL to the depth tree.

public function addFromHtml(string $html, UriInterface $foundOnUrl): void
{
    $allLinks = $this->extractLinksFromHtml($html, $foundOnUrl);

    collect($allLinks)
        ->filter(fn (Url $url) => $this->hasCrawlableScheme($url))
        ->map(fn (Url $url) => $this->normalizeUrl($url))
        ->filter(function (Url $url) use ($foundOnUrl) {
            // will always fail if redirect
            if (! $node = $this->crawler->addToDepthTree($url, $foundOnUrl)) {
                return false;
            }

            return $this->shouldCrawl($node);
        })
        ->filter(fn (Url $url) => ! str_contains($url->getPath(), '/tel:'))
        ->each(function (Url $url) use ($foundOnUrl) {
            $crawlUrl = CrawlUrl::create($url, $foundOnUrl, linkText: $url->linkText());

            $this->crawler->addToCrawlQueue($crawlUrl);
        });
}

The parent (or base) node on the tree will still be the original URL that was found on the page or supplied by the user as the initial crawl URL. Since the parent URL differs from the node on the depth tree, no children will ever be added.

public function addToDepthTree(UriInterface $url, UriInterface $parentUrl, ?Node $node = null): ?Node
{
    if (is_null($this->maximumDepth)) {
        return new Node((string) $url);
    }

    $node = $node ?? $this->depthTree;

    $returnNode = null;

    // this will never be true if the link was a redirect since $parentUrl won't be on the tree
    if ($node->getValue() === (string) $parentUrl) {
        $newNode = new Node((string) $url);

        $node->addChild($newNode);

        return $newNode;
    }

    foreach ($node->getChildren() as $currentNode) {
        $returnNode = $this->addToDepthTree($url, $parentUrl, $currentNode);

        if (! is_null($returnNode)) {
            break;
        }
    }

    // so returns null
    return $returnNode;
}

A fix would be to pass through the original URL as well, and if it is on the tree, add the $url as a child to that node.

freekmurze commented 1 month ago

Will be fixed by #467