teamtnt / laravel-scout-tntsearch-driver

Driver for Laravel Scout search package based on https://github.com/teamtnt/tntsearch
MIT License
1.09k stars 144 forks source link

Import gets progressively slower #317

Closed MikePriceAero closed 3 months ago

MikePriceAero commented 3 years ago

I am importing two models, which extend a base model elsewhere in my application because these models are in a package and I cannot add the Searchable trait directly. One has 600,000 rows and the other 800,000. Each of them has a single relation which is loaded via protected $with and then referenced in toSearchableArray():

public function toSearchableArray()
    {
        $addressArray = [];

        $this->addresses->each(function ($address) use (&$addressArray) {
            $addressArray[] = $address->line_1 . ' ' . $address->city . ' ' . $address->postcode;
        });

        return [
            'id' => $this->id,
            'name' => $this->name,
            'email' => $this->email,
            'emailPlain' => str_replace(['.', '@'], ' ', $this->email),
            'account_id' => $this->account_id,
            'addresses' => implode(' ', $addressArray),
        ];
    }

I have currently reduced the Scout chunk size to 100 to see if this would prevent me needing to write this ticket, but unfortunately, it did not.

When I run php artisan scout:import <model> it starts really well - flying through 100 records at a time, around 3,000 records a minute, and looks like it's going to be finished in no time at all. Unfortunately, the longer the import goes on, the longer it starts to take to process each chunk. The last 100 records on one of those imports, as I write this, took around 12 seconds - that's a drop from 6,000 records/minute to around 500 - and this is before we've reached 100,000 on the larger import. If it continues to slow at this rate it'll still be going when I get up in the morning.

What can I do to ensure a consistent speed is maintained? I've read about MySQL having problems with offsetting, but we're on a Scout version here that is already using chunkById so this doesn't appear to be the problem.

nticaric commented 3 years ago

Out of the blue, I would say that the problem might be in low cardinality data. Meaning, you have a lot of the same cities, postal codes, etc.

The slow down, in this case, wouldn't continue linearly but logarithmicaly, so probably the insertion rate will stay at around 500.

To further clarify what's going on. When you insert a term into the index, it simply does a regular INSERT together with number of occurrences and the documents this term occurred. Next, if you try to insert the same term again, it will do the following:

In a try catch block, an insert will be performed. Since the term already exists, it will throw an exception, where we check if it was an unique constraint exception. So we do a select to get the correct row, and then do an update to increment the term count.

This is basically an upsert operation which, at the time of writing, wasn't supported in SQLite. This is why it's written in a try catch block, and the reason why it's slow. However, in the meanwhile upsert has been added to SQLite and will come in the next releases of TNTSearch which will drastically improve performance.

The TNTSearch package has some more improvements to tackle this case over the scout driver. All those additions can be made in memory. But for this, you have to use the package directly for index creation. It should speed up things significantly. You can find examples in the TNTSearch documentation on how to build the index.

You also might want to check our new package for analytics, but you don't have to.

MikePriceAero commented 3 years ago

I'll check out whether it's possible to build the index natively and let you know how I get on - thanks for your reply. 🙂