Closed bobemoe closed 4 years ago
subscribe 👍️
As you can see here: https://github.com/mvdbos/php-spider/blob/master/example/example_complex.php#L50, you can set your own persistence handler on the Downloader. This handler receives a Resource
for every URL fetched, which includes all HTTP headers. With a persistence handler, you can go a long way to a link checker. Just save everything, and then check return codes for each resource fetched.
It has some limitations though. IIRC, a Resource
is only returned for succesful requests. I don't remember exactly when an exception is thrown here, which would prevent a resource form being returned. This depends on http://docs.guzzlephp.org/en/stable/quickstart.html#exceptions.
If you want to see all response types, also failed ones that are caught to greedily by that try/catch currently, then you probably want to change Downloader.fetchResource()
(https://github.com/mvdbos/php-spider/blob/master/src/VDB/Spider/Downloader/Downloader.php#L124) so that it is smarter about these exceptions and always returns a resource. Fortunately, you can create your own Downloader by extending this one, and set the Spider to use with Spider.setDownloader()
.
Alternatively, feel free to create a PR that improves the core Downloader.
I just had another quick look at it. You actually only have to make one simple change to get a saved Resource for any link, even if it is broken.
First, extend GuzzleRequestHandler
, and override the request()
method:
Instead of like this:
public function request(DiscoveredUri $uri)
{
$response = $this->getClient()->get($uri->toString());
return new Resource($uri, $response);
}
It should look like this. Note the addition of an option to the Guzzle request. This will prevent it from throwing an exception on failed requests and just return a response with that status code:
public function request(DiscoveredUri $uri)
{
$response = $this->getClient()->get($uri->toString(), ['http_errors' => false]);
return new Resource($uri, $response);
}
Then set your RequestHandler
on the Downloader like so: $spider->getDownloader()->setRequestHandler($yourRequestHandler)
.
Unfortunately, RequestHandlerInterface
currently has no setRequestHandler
method. I will add that soon.
Unfortunately, RequestHandlerInterface currently has no setRequestHandler method. I will add that soon.
Done. Also added example/example_link_check.php
so you can see how to use it.
Is this suitable to use as a base for developing a link checker?
I've given it a quick go, but cant find an easy way to get the response code for each link found?
I'd be wanting to create a report of 404's, 500's etc that need attention. 200's for the sitemap. 301's and 302's that maybe need fixing...