mvdbos / php-spider

A configurable and extensible PHP web spider
MIT License
1.33k stars 234 forks source link

Is it possible to skip creation of the results files and just report if the links are valid? #103

Open dingo-d opened 5 months ago

dingo-d commented 5 months ago

Hi.

I'm wondering if it's possible to use the link checker example to just check for valid links, and maybe store them in a JSON, or CSV file instead of creating binary files and index.html files inside the results folder?

Should I try to create my own persistence handler for this?

Basically, I'd just like to crawl my web to check if there are any 404 pages in my web, I'm not necessarily interested if any of the links on the page is returning 404, just need to check if all my pages are healthy.

dingo-d commented 5 months ago

I created a JsonPersistenceHandler.php

<?php

use VDB\Spider\PersistenceHandler\FilePersistenceHandler;
use VDB\Spider\PersistenceHandler\PersistenceHandlerInterface;
use VDB\Spider\Resource;

class JsonPersistenceHandler extends FilePersistenceHandler implements PersistenceHandlerInterface
{
    protected string $defaultFilename = 'data.json';

    #[\Override]
    public function persist(Resource $resource)
    {
        $file = $this->getResultPath() . 'data.json';

        // Check if file exists.
        if (!file_exists($file)) {
            // Create file if it doesn't exist.
            $fileHandler = fopen($file, 'w');
            $results = [];
        } else {
            // Open file if it exists.
            $fileHandler = fopen($file, 'c+');

            // Check if file is not empty before reading.
            if (filesize($file) > 0) {
                // Read file and decode the json.
                $results = json_decode(fread($fileHandler, filesize($file)), true);
            } else {
                $results = [];
            }
        }

        $url = $resource->getUri()->toString();
        $statusCode = $resource->getResponse()->getStatusCode();

        $results[$url] = $statusCode;

        // Move the pointer to the beginning of the file before writing.
        rewind($fileHandler);

        // Write to file.
        fwrite($fileHandler, json_encode($results));

        // Close the file handler.
        fclose($fileHandler);
    }

    #[\Override]
    public function current(): Resource
    {
        return unserialize($this->getIterator()->current()->getContents());
    }
}

And this works kinda. The only thing is that I don't get all the links from the webpage, only 125.

Can the crawler get the sitemap.xml and try to parse that to get all the links?

mvdbos commented 4 months ago

@dingo-d Currently the spider does not support parsing the sitemap.xml.

Your approach with a custom persistence handler icm with the link checker seems correct.

Are you sure there are more than 125 links on the page/website? If so:

Interested to hear what you find.

dingo-d commented 4 months ago

Are you sure your XPathExpressionDiscoverer is configured correctly to find all links? Are all links in the DOM on page render, or are some added later with JavaScript? Those won't be found, since PHP-spider does not use a headless browser.

The site is a WordPress site, so all links should be present. But it's good to know about JS added ones 👍🏼

Did you set a downloadLimit on the Downloader?

Yup, added $spider->getDownloader()->setDownloadLimit(1000000);

Did you leave some of the filters in place, such as UriWithHashFragmentFilter or UriWithQueryStringFilter? With those in place, URLs with fragments on query strings are skipped.

These were the filters I've added:

$spider->getDiscovererSet()->addFilter(new AllowedSchemeFilter(array('https')));
$spider->getDiscovererSet()->addFilter(new AllowedHostsFilter(array($seed), $allowSubDomains));
$spider->getDiscovererSet()->addFilter(new UriWithHashFragmentFilter());
$spider->getDiscovererSet()->addFilter(new UriWithQueryStringFilter());

Did you set the maxDepth on the discovererSet? If so, and it is 1, the spider will limit itself to the current page and siblings, and not descend further.

I had $spider->getDiscovererSet()->maxDepth = 2;, I think I added like 10, but that took too long. I think that even with 2, the crawler took over an hour and still didn't finish crawling 😅

All in all I did get the JSON file with some 503 statuses.

The idea was to use it as a site-health checker.