Closed crivion closed 6 years ago
Could you PR a failing test?
Hi @freekmurze
Thanks a lot for getting back.
Sorry I have no knowledge in testing with PHPUnit if that's what you meant.
The thing is when it reaches an URL which needs skipping until at a date - it tries to parse that URL again and again and again and again rather than skipping/continuing to next URL and so on.
Any ideas?
` namespace App;
use Spatie\Crawler;
class CrawlByLastUpdatedAt implements \Spatie\Crawler\CrawlProfile {
public function shouldCrawl(\Spatie\Crawler\Url $url, \Spatie\Crawler\Url $foundOnUrl = null): bool
{
$link = \App\Link::where('source', $url->__toString())->first();
$isNew = "\033[33m - NEW";
if( $url->__toString() == 'http://geoip.crivion.com/' )
$isNew = "";
// if URL already exists
if( $link && $url->__toString() != 'http://geoip.crivion.com/' ) {
// how often to check again in hours
$howOftenToCheckAgainInHours = 72;
// set hours since last check
$hoursSinceLastCheck = $link->updated_at->diffInHours( \Carbon\Carbon::now() );
// set hours till next check
$hoursTillNextCheck = $howOftenToCheckAgainInHours-$hoursSinceLastCheck;
// if the URL last_updated time < $howOftenToCheckAgain
if( $hoursSinceLastCheck >= $howOftenToCheckAgainInHours ) {
// update last checked
$link->touch();
echo PHP_EOL . "\033[36m [RECHECKING] \033[0m" . $url->__toString() . PHP_EOL;
return true;
} else {
echo PHP_EOL . "\033[35m [SKIP: $hoursTillNextCheck hrs till next check] \033[0m " . $url->__toString() . PHP_EOL;
return false;
}
}
echo PHP_EOL . "\033[32m [CRAWLING] $isNew \033[0m " . $url->__toString() . PHP_EOL;
return true;
}
} `
@crivion Hey :) Few things I would suggest:
\033[32m
in the code snippet? Are you on Windows? Because these characters usually appear in the windows terminal.Hey @introwit
I'm on Linux
The characters are used for coloring php CLI (works on all CLIs)
The issue is package specific as it goes into a loop when trying to skip URL. My code has been shared already.
Anyways I've quit the package and doing my own crawler.
so shouldCrawl() when returned false goes into a loop and gets stuck in the same URL instead of continuing towards the next collection item.
Thanks.
Hi there
I'm implementing a broken link checker and trying to implement my own CrawlProfile.
If you return false onto shouldCrawl() it goes into a loop
` // how often to check again in hours $howOftenToCheckAgainInHours = 72;
`