spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
https://freek.dev/308-building-a-crawler-in-php
MIT License
2.51k stars 357 forks source link

Maximum depth not working in subsequent request #397

Closed dacdam closed 2 years ago

dacdam commented 2 years ago

Hi, I'm trying to split a crowling job in more subsequent requests, limiting the crwaling depth. So I have a Laravel job witch takes care of executing the crawling and, if by reaching the currentCrawlLimit the queue is'nt empty, to dispacth himsefl again passing the same istance of the queue.

The queue it's written by me and save/store info on a database.

The problem is that the crawl doesn't go further from the root of the links's alberature.

Looking deeper at the code into the class Crawler.php I found that the startCrawling method, witch is invoked at every job execution, creates a new istance of the Node who get stored in the depthTree attribute.

I think this is the matter as that attribute should be kept between subsequent requests, am I wrong ?

This is the job's source code:

<?php

namespace App\Jobs;

use App\Crawler\CrawlObserver;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldBeUnique;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;

use App\Crawler\DatabaseCrawlQueue;
use App\Models\Analisi;
use Illuminate\Support\Facades\Log;
use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlProfiles\CrawlInternalUrls;

class AnalisiPagine implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    protected $analisi;
    protected $codaCrawler;

    /**
     * Create a new job instance.
     *
     * @return void
     */
    public function __construct(Analisi $analisi, DatabaseCrawlQueue $codaCrawler)
    {
        $this->analisi = $analisi;
        $this->codaCrawler = $codaCrawler;
    }

    /**
     * Execute the job.
     *
     * @return void
     */
    public function handle()
    {

        //Esecuzione del crawler
        Crawler::create([
            "cookies" => true,
            "connect_timeout" => 60,
            "read_timeout" => 60,
            "timeout" => 60,
            "allow_redirects" => false,
            "headers" => [
                "User-Agent" => "*"
            ]
        ])
        ->setCrawlObserver(new CrawlObserver($this->analisi))
        ->executeJavaScript()
        ->setCrawlQueue($this->codaCrawler)
        ->setTotalCrawlLimit($this->analisi->max_pagine)
        ->setMaximumDepth($this->analisi->max_profondita)
        ->setCurrentCrawlLimit(1)
        ->setCrawlProfile(new CrawlInternalUrls($this->analisi->indirizzo))
        ->setParseableMimeTypes(['text/html', 'text/plain'])
        ->setMaximumResponseSize(1024 * 1024 * 5)
        ->setDelayBetweenRequests(500)
        ->ignoreRobots()
        ->startCrawling($this->analisi->indirizzo);

        //Se il crawling non è terminato lancio un'altra istanza di questo lavoro
        if($this->codaCrawler->hasPendingUrls()){
            self::dispatch($this->analisi,$this->codaCrawler);
        }

    }
}
spatie-bot commented 2 years ago

Dear contributor,

because this issue seems to be inactive for quite some time now, I've automatically closed it. If you feel this issue deserves some attention from my human colleagues feel free to reopen it.

hettiger commented 6 months ago

I'm having the same issue. The depth tree is rebuilt on each subsequent request via the url parser. However, the url parser only takes into account the currently crawled html document. It does not add previous urls to the depth tree as far as i can tell. So I think @dacdam is right here. Maximum depth is not working in subsequent request. The docs should be updated or support should be added.

hettiger commented 6 months ago

I'm having the same issue. The depth tree is rebuilt on each subsequent request via the url parser. However, the url parser only takes into account the currently crawled html document. It does not add previous urls to the depth tree as far as i can tell. So I think @dacdam is right here. Maximum depth is not working in subsequent request. The docs should be updated or support should be added.

I found a workaround. I've extended the LinkUrlParser. Here's the overwritten constructor that rebuilds the depth tree on each subsequent request:

public function __construct(Crawler $crawler)
    {
        parent::__construct($crawler);

        CrawlUrl::each(function (CrawlUrl $crawlUrl) {
            if (!$crawlUrl->found_on_url) { return; }

            $this->crawler->addToDepthTree(
                new Uri($crawlUrl->url),
                new Uri($crawlUrl->found_on_url)
            );
        });
    }

CrawlUrl is a Model. I'm using a custom built database crawl queue.