ShouldCrawl goes into a loop if false

crivion commented 7 years ago

Hi there

I'm implementing a broken link checker and trying to implement my own CrawlProfile.

If you return false onto shouldCrawl() it goes into a loop

` // how often to check again in hours $howOftenToCheckAgainInHours = 72;

        // set hours since last check
        $hoursSinceLastCheck = $link->updated_at->diffInHours( \Carbon\Carbon::now() );

        // set hours till next check
        $hoursTillNextCheck = $howOftenToCheckAgainInHours-$hoursSinceLastCheck;

        // if the URL last_updated time < $howOftenToCheckAgain
        if( $hoursSinceLastCheck >= $howOftenToCheckAgainInHours ) {

            // update last checked
            $link->touch();

            echo PHP_EOL . "\033[36m [RECHECKING] \033[0m" . $url->__toString() . PHP_EOL;
            return true;

        } else {

            echo PHP_EOL . "\033[35m [SKIP: $hoursTillNextCheck hrs till next check] \033[0m " . $url->__toString() . PHP_EOL;
            return false;

        }

`

freekmurze commented 7 years ago

Could you PR a failing test?

crivion commented 7 years ago

Hi @freekmurze

Thanks a lot for getting back.

Sorry I have no knowledge in testing with PHPUnit if that's what you meant.

The thing is when it reaches an URL which needs skipping until at a date - it tries to parse that URL again and again and again and again rather than skipping/continuing to next URL and so on.

Any ideas?

` namespace App;

use Spatie\Crawler;

class CrawlByLastUpdatedAt implements \Spatie\Crawler\CrawlProfile {

public function shouldCrawl(\Spatie\Crawler\Url $url, \Spatie\Crawler\Url $foundOnUrl = null): bool
{   
    $link = \App\Link::where('source', $url->__toString())->first();

    $isNew = "\033[33m - NEW";

    if( $url->__toString() == 'http://geoip.crivion.com/' )
        $isNew = "";

    // if URL already exists
    if( $link && $url->__toString() != 'http://geoip.crivion.com/' ) {

        // how often to check again in hours
        $howOftenToCheckAgainInHours = 72;

        // set hours since last check
        $hoursSinceLastCheck = $link->updated_at->diffInHours( \Carbon\Carbon::now() );

        // set hours till next check
        $hoursTillNextCheck = $howOftenToCheckAgainInHours-$hoursSinceLastCheck;

        // if the URL last_updated time < $howOftenToCheckAgain
        if( $hoursSinceLastCheck >= $howOftenToCheckAgainInHours ) {

            // update last checked
            $link->touch();

            echo PHP_EOL . "\033[36m [RECHECKING] \033[0m" . $url->__toString() . PHP_EOL;
            return true;

        } else {

            echo PHP_EOL . "\033[35m [SKIP: $hoursTillNextCheck hrs till next check] \033[0m " . $url->__toString() . PHP_EOL;
            return false;

        }

    }

    echo PHP_EOL . "\033[32m [CRAWLING] $isNew \033[0m " . $url->__toString() . PHP_EOL;

    return true;

}

} `

introwit commented 6 years ago

@crivion Hey :) Few things I would suggest:

What are these characters such as \033[32m in the code snippet? Are you on Windows? Because these characters usually appear in the windows terminal.
The issue you have pointed out is not a package specific issue rather you are trying to implement your own logic which isn't working well. So, I will close this issue but will continue to help you to get things right in this thread :)
Let me know in simple steps what are you trying to achieve so I can help you with some code :)

crivion commented 6 years ago

Hey @introwit

I'm on Linux

The characters are used for coloring php CLI (works on all CLIs)

The issue is package specific as it goes into a loop when trying to skip URL. My code has been shared already.

Anyways I've quit the package and doing my own crawler.

so shouldCrawl() when returned false goes into a loop and gets stuck in the same URL instead of continuing towards the next collection item.

Thanks.

spatie / laravel-link-checker

ShouldCrawl goes into a loop if false #27