mvdbos / php-spider

A configurable and extensible PHP web spider
MIT License
1.33k stars 233 forks source link

suitable as link checker? #63

Closed bobemoe closed 4 years ago

bobemoe commented 4 years ago

Is this suitable to use as a base for developing a link checker?

I've given it a quick go, but cant find an easy way to get the response code for each link found?

I'd be wanting to create a report of 404's, 500's etc that need attention. 200's for the sitemap. 301's and 302's that maybe need fixing...

spekulatius commented 4 years ago

subscribe 👍️

mvdbos commented 4 years ago

As you can see here: https://github.com/mvdbos/php-spider/blob/master/example/example_complex.php#L50, you can set your own persistence handler on the Downloader. This handler receives a Resource for every URL fetched, which includes all HTTP headers. With a persistence handler, you can go a long way to a link checker. Just save everything, and then check return codes for each resource fetched.

It has some limitations though. IIRC, a Resource is only returned for succesful requests. I don't remember exactly when an exception is thrown here, which would prevent a resource form being returned. This depends on http://docs.guzzlephp.org/en/stable/quickstart.html#exceptions.

If you want to see all response types, also failed ones that are caught to greedily by that try/catch currently, then you probably want to change Downloader.fetchResource() (https://github.com/mvdbos/php-spider/blob/master/src/VDB/Spider/Downloader/Downloader.php#L124) so that it is smarter about these exceptions and always returns a resource. Fortunately, you can create your own Downloader by extending this one, and set the Spider to use with Spider.setDownloader().

Alternatively, feel free to create a PR that improves the core Downloader.

mvdbos commented 4 years ago

I just had another quick look at it. You actually only have to make one simple change to get a saved Resource for any link, even if it is broken.

First, extend GuzzleRequestHandler, and override the request() method:

Instead of like this:

    public function request(DiscoveredUri $uri)
    {
        $response = $this->getClient()->get($uri->toString());
        return new Resource($uri, $response);
    }

It should look like this. Note the addition of an option to the Guzzle request. This will prevent it from throwing an exception on failed requests and just return a response with that status code:

    public function request(DiscoveredUri $uri)
    {
        $response = $this->getClient()->get($uri->toString(), ['http_errors' => false]);
        return new Resource($uri, $response);
    }

Then set your RequestHandler on the Downloader like so: $spider->getDownloader()->setRequestHandler($yourRequestHandler).

Unfortunately, RequestHandlerInterface currently has no setRequestHandler method. I will add that soon.

mvdbos commented 4 years ago

Unfortunately, RequestHandlerInterface currently has no setRequestHandler method. I will add that soon.

Done. Also added example/example_link_check.php so you can see how to use it.