roach-php / core

The complete web scraping toolkit for PHP.
https://roach-php.dev
1.35k stars 69 forks source link

Duplicate requests being dispatched even with RequestDeduplicationMiddleware in place #36

Open awebartisan opened 2 years ago

awebartisan commented 2 years ago

I have a list of URLs in the database, scrapping specific information these URLs. I have split these URLs in portions of 50 and dispatch a job by giving the offset from database to start from.

Each job gets the 50 URLs from database and spider starts sending requests. 2 concurrent requests with 1 second delay. At some point it starts sending duplicate requests as can be seen below and Deduplication middleware doesn't report/drop these requests. Not sure what's going on here. Any thoughts?

[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}
awebartisan commented 2 years ago

Is it possible that multiple instances of same Spider are using same requests??

ksassnowski commented 2 years ago

Are these logs from multiple spider runs or are they all from the same run? The RequestDeduplicationMiddleware only looks at requests that have been sent during the current run. So if you start multiple spiders with the same URLs, they will all scrape the same site.

My first guess would be that you are dispatching multiple jobs at the same time and they all query the same records from the database. Can you maybe show what the code that dispatches your jobs looks like?

awebartisan commented 2 years ago

This is how I am dispatching jobs from a console command.


    public function handle(): int
    {
        for ($offset = 1; $offset <= 1000; $offset = $offset + 50) {
            dispatch(new ScrapeStoreSocialLinksJob($offset));
        }

        return 0;
    }

Below is what my job looks like:

    public $timeout = 300;

    public function __construct(public int $offset)
    {}

    public function handle()
    {
        Roach::startSpider(StoreSocialLinksSpider::class, context: ['offset' => $this->offset]);
    }

These logs are from different RUNs, but from the logs I can see these RUNS start at the same time and end at the same time.

I have even tried to chain these jobs so that next job gets dispatched after first one is completed, but still gets duplicate RUNs.

ksassnowski commented 2 years ago

Can you show what the initialRequests method of your spider looks like?

awebartisan commented 2 years ago
    protected function initialRequests(): array
    {
        return ShopifyStore::query()
            ->offset($this->context['offset'])
            ->limit(50)
            ->get()
            ->map(function (ShopifyStore $shopifyStore) {
                $request = new Request(
                    'GET',
                    "https://" . $shopifyStore->url,
                    [$this, 'parse']
                );
                return $request->withMeta('store_id', $shopifyStore->id);
            })->toArray();
    }

Behaviour I noticed in the logs:

Below are some stats from the logs

[2022-04-25 05:31:36] local.INFO: Run statistics {"duration":"00:00:57","requests.sent":150,"requests.dropped":0,"items.scraped":146,"items.dropped":0}
[2022-04-25 05:31:36] local.INFO: Run statistics {"duration":"00:00:57","requests.sent":100,"requests.dropped":0,"items.scraped":98,"items.dropped":0}
[2022-04-25 05:31:36] local.INFO: Run statistics {"duration":"00:00:57","requests.sent":50,"requests.dropped":0,"items.scraped":48,"items.dropped":0}
[2022-04-25 05:31:36] local.INFO: Run finished
[2022-04-25 05:31:36] local.INFO: Run finished
[2022-04-25 05:31:36] local.INFO: Run finished
ksassnowski commented 2 years ago

This may be a silly question, but does your ShopifyStore model contain any duplicates? I can't really see what could be going wrong otherwise. It's also a little strange how the requests.sent and items.scraped both change by exactly 50 (which is also your limit). Does your parse method dispatch additional requests for certain responses?

awebartisan commented 2 years ago

After your comment I went ahead and checked for duplicates in the table. There were indeed some duplicates. Removed them.

But problem still happening.

Below is my Spider's full source code:

<?php

namespace App\Spiders;

use App\Extractors\Stores\AssignCategory;
use App\Extractors\Stores\ExtractContactUsPageLink;
use App\Extractors\Stores\ExtractDescription;
use App\Extractors\Stores\ExtractFacebookProfileLink;
use App\Extractors\Stores\ExtractInstagramProfileLink;
use App\Extractors\Stores\ExtractLinkedInProfileLink;
use App\Extractors\Stores\ExtractTikTokProfileLink;
use App\Extractors\Stores\ExtractTitle;
use App\Extractors\Stores\ExtractTwitterProfileLink;
use App\Models\ShopifyStore;
use App\Processors\SocialLinksDatabaseProcessor;
use Generator;
use Illuminate\Pipeline\Pipeline;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Extensions\LoggerExtension;
use RoachPHP\Extensions\StatsCollectorExtension;
use RoachPHP\Http\Request;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Spider\ParseResult;

class StoreSocialLinksSpider extends BasicSpider
{
    public array $startUrls = [
        //
    ];

    public array $downloaderMiddleware = [
        RequestDeduplicationMiddleware::class,
    ];

    public array $spiderMiddleware = [
        //
    ];

    public array $itemProcessors = [
        //SocialLinksDatabaseProcessor::class,
    ];

    public array $extensions = [
        LoggerExtension::class,
        StatsCollectorExtension::class,
    ];

    public int $concurrency = 2;

    public int $requestDelay = 1;

    /**
     * @return Generator<ParseResult>
     */
    public function parse(Response $response): Generator
    {
        $storeData = [
            'store_id' => $response->getRequest()->getMeta('store_id')
        ];

        [, $storeData] = app(Pipeline::class)
            ->send([$response, $storeData])
            ->through([
                ExtractTitle::class,
                ExtractDescription::class,
                ExtractTwitterProfileLink::class,
                ExtractFacebookProfileLink::class,
                ExtractInstagramProfileLink::class,
                ExtractTikTokProfileLink::class,
                ExtractLinkedInProfileLink::class,
                ExtractContactUsPageLink::class
            ])
            ->thenReturn();

        yield $this->item($storeData);
    }

    protected function initialRequests(): array
    {
        return ShopifyStore::query()
            ->offset($this->context['offset'])
            ->limit(50)
            ->get()
            ->map(function (ShopifyStore $shopifyStore) {
                $request = new Request(
                    'GET',
                    "https://" . $shopifyStore->url,
                    [$this, 'parse']
                );
                return $request->withMeta('store_id', $shopifyStore->id);
            })->toArray();
    }
}

parse() method is not making any additional requests.

My thinking here is that something going on with Spider's instance and container.

ksassnowski commented 2 years ago

So my thinking is that the spiders aren't actually sending duplicate requests, but that the extensions (the Logger and StatsCollector, specifically) are reacting to events from different spiders. Couple more questions:

awebartisan commented 2 years ago
awebartisan commented 2 years ago

Hey @ksassnowski , you are right about the second part. In my SocialLinksDatabaseProcessor I am not getting duplicate items for the duplicate URLs.

So your thinking about the extensions like Logger and StatsCollector sounds right to me.

code-poel commented 2 years ago

Just wanted to chime in that I'm experiencing something similar. I have two spiders being executed from a single Laravel Command. Executing one (or the other) results in the StatsCollector outputting expected results. However, if I have both spiders executed, I get a third output of the StatsCollector output that looks like a combination of both. Even if I put a sleep(5) between their execution in the Command, the third, cumulative StatsCollector output occurs...

ksassnowski commented 2 years ago

I understand why this happens in your case, @code-poel. Assuming your handle method looks something like this

public function handle()
{
    Roach::startSpider(MySpider1::class);
    Roach::startSpider(MySpider2::class);
}

This is because the EventDispatcher that all extensions rely on gets registered as a singleton. So every spider you run in the same PHP "process" will essentially register its extensions as event listeners again. That's why I was wondering if @awebartisan used Laravel Octane or something similar. It sounded like his commands only spawn a single spider per command so that shouldn't happen.

ksassnowski commented 2 years ago

The solution might be to assign every run a unique id and include that as part of the event payload. Then I could scope the events and all corresponding handlers to just that id, even if multiple spiders get started in the same process. I have to check if this can be done without a BC break.

code-poel commented 2 years ago

I understand why this happens in your case, @code-poel. Assuming your handle method looks something like this

public function handle()
{
    Roach::startSpider(MySpider1::class);
    Roach::startSpider(MySpider2::class);
}

This is because the EventDispatcher that all extensions rely on gets registered as a singleton. So every spider you run in the same PHP "process" will essentially register its extensions as event listeners again. That's why I was wondering if @awebartisan used Laravel Octane or something similar. It sounded like his commands only spawn a single spider per command so that shouldn't happen.

Yup, that's exactly right. Thanks for the clarification on the root cause!

wengooooo commented 1 year ago

This bug has existed for more than 1 year, why hasn't it been fixed by now?

ksassnowski commented 1 year ago

Because no one has opened a PR yet to fix it.