roach-php / core

The complete web scraping toolkit for PHP.
https://roach-php.dev
1.37k stars 70 forks source link

parse method is called even on response status 404 #233

Closed rabol closed 8 months ago

rabol commented 8 months ago

Describe the bug The parse() method of a spider is called even if a wrong url is passed in and the response result code is 404.

Expected behavior that the pasing is stopped in case of not status 200.

Package versions (please complete the following information):

roach-php/laravel v3.1.0

Additional context Spider class:

<?php

namespace App\Spiders;

use Generator;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Extensions\LoggerExtension;
use RoachPHP\Extensions\StatsCollectorExtension;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Spider\ParseResult;
use Symfony\Component\DomCrawler\Crawler;

class BBCGoodFood extends BasicSpider
{
    public array $startUrls = [
        //
    ];

    public array $downloaderMiddleware = [
        RequestDeduplicationMiddleware::class,
    ];

    public array $spiderMiddleware = [
        //
    ];

    public array $itemProcessors = [
        //
    ];

    public array $extensions = [
        LoggerExtension::class,
        StatsCollectorExtension::class,
    ];

    public int $concurrency = 2;

    public int $requestDelay = 1;

    /**
     * @return Generator<ParseResult>
     */
    public function parse(Response $response): Generator
    {
        yield $this->item([
            'title' => $response->filterXPath('//div[@class="headline post-header__title post-header__title--masthead-layout"]/h1[@class="heading-1"]')->text(),
            'ingredients' => $response->filterXPath('//section[contains(@class, "recipe__ingredients" )]/section/ul[@class="list"]/li')->each(
                fn (Crawler $entry) => $entry->text()),
            'instructions' => $response->filterXPath('//section[contains(@class, "recipe__method-steps" )]/*/*/li')->each(
                fn (Crawler $entry) => $entry->text()),

        ]);
    }
}

started like this:

        $result = Roach::collectSpider(
            BBCGoodFood::class,
            new Overrides(startUrls: ['https://example.com/does-not-exists']),
        );

this will give you a The current node list is empty. error

ksassnowski commented 8 months ago

Thanks for pointing this out, this is something I've been meaning to get around to since forever. I just released a new version which adds a HttpErrorMiddleware that automatically drops responses outside the 200-300 range. This middleware is enabled by default if your spider extends BasicSpider (provided you're not overriding the $downloaderMiddleware property).

Here are the relevant docs which also explain the available configuration options: https://roach-php.dev/docs/downloader-middleware#handling-http-errors

rabol commented 8 months ago

Thanks:

small note: the stub that is used when one create a Spider, should not 'override' the BaseSpider properties by default.

e.g.

in my case I now have to double check all spiders because the BasicSpider now have this:

    public array $downloaderMiddleware = [
        RequestDeduplicationMiddleware::class,
        HttpErrorMiddleware::class,
    ];

and my Spider created by roach have this:

public array $downloaderMiddleware = [
        RequestDeduplicationMiddleware::class,
      ];