teamtnt / laravel-scout-tntsearch-driver

Driver for Laravel Scout search package based on https://github.com/teamtnt/tntsearch
MIT License
1.09k stars 144 forks source link

Fantastic package but having issues importing large data (Memory exhausted) #81

Closed BirdyUK closed 3 months ago

BirdyUK commented 7 years ago

Hi, I am having issues importing a large databse of 30+ Million records.

I am constantly getting this issue when i get to approx 1+ million..

Processed 1730000 rows
Commited
Processed 1731000 rows
Processed 1732000 rows
Processed 1733000 rows
Processed 1734000 rows
Processed 1735000 rows
PHP Fatal error:  Allowed memory size of 268435456 bytes exhausted (tried to allocate 44 bytes) in /www/default/address/vendor/teamtnt/tntsearch/src/Indexer/TNTIndexer.php line 

[Symfony\Component\Debug\Exception\FatalErrorException] Allowed memory size of 268435456 bytes exhausted (tried to allocate 150994944 bytes)

The server is has 128gb of ram and dual xeon e5's so memory should not be an issue..

Is there anything you can suggest doing? The command i run is: php artisan tntsearch:import App\\Addresses

Once i get this resolved and have the data / search running i am certainly going to become a sponsor to this project as it has/will save me a lot of headaches with the project i am trying to get this working on :)

UPDATE: Using the following command ...

php -d memory_limit=4048M artisan tntsearch:import App\\Addresses To increase the memory limit on the fly helped get the import to 8 million but same issue occurred.

Always points back to the TNTIndexer.php, is there a way we can make this more memory friendly?

Thanks

nticaric commented 7 years ago

@BirdyUK you have a pretty good server :) In order to speed things up, the indexer does everything in memory. However, on a very large dataset, sooner or later, you'll run out of memory.

That's why there is an option to turn this off. Take a look here

https://github.com/teamtnt/tntsearch/blob/master/src/Indexer/TNTIndexer.php#L34

The flag is called inMemory and it should be set to false in your case. This should then prevent the system to run out of memory. Since in the current version of laravel scout, it's not possible to tweak this, my suggestion is to directly modify the source of TNTSearch package and see if this will help resolve your issue.

If you successfully resolve it, we'll add it to the configuration. The only thing I am currently not sure about is if we should add it to the global configuration or to the model itself as a property.

The asYouType property makes more sense in the model. It would look something like

<?php

namespace App;

use Illuminate\Database\Eloquent\Model;
use Laravel\Scout\Searchable;

class Post extends Model
{
    use Searchable;
    public asYouType = true;
    protected inMemory = true; //not sure if this should be set globally or not
}

Let us know what you think.

A 30 million records dataset is a really good showcase to see how well TNTSearch will perform. Maybe we could help you with some pointers if you share your Address table structure with us.

For example the command php artisan tntsearch:import App\\Addresses will also import the timestamps which aren't necessary and slow down the indexing

BirdyUK commented 7 years ago

Hi @nticaric - Just to inform you, I ran php -d memory_limit=-1 artisan ....

And touch wood i am now at 20+ million with no issues what so ever and fuzzy search is returning results in 40ms :) 👍 -- Very excited to see what the end result will be :)

The database table holds 35 Million i think so will let you know more when the import finishes :)

My plan is to offer a free address lookup / postcode lookup / address finder api for uk addresses as all the paid services charge at least 2 pence per lookup and the help of this package has hopefully made it easier so sponsorship will be certain :)

P.S: Once this import finishes i will do some further testing and import a second table with some advice from you and see what works out to be better, I will also modify the source and see what difference it makes and monitor system stats.

Test Table for adresses is: id postcode address city lat lng created_at updated_at

Production table that will be the next import i do with TNTSearch will hopefully be: id postcode address

no timestamps needed, All the data in the table is static and will rarely change, We only add new data every yearly quarter that is very minimal for new addresses and postcodes or removal of depreciated addresses and postcodes.

tntsearch-import

nticaric commented 7 years ago

Glad that it's working out for you. Let us know the results. I'll try to add the as you type functionality as a runtime option

BirdyUK commented 7 years ago

@nticaric - Do you have a way i can contact you for private communication, It would be great to get some thoughts on the project im working on, I will get some funds over to you via the monthly sponsorship on Monday to help with your overheads to enable you to work on TNTSearch :)

BirdyUK commented 7 years ago

@nticaric - Unfortunately it crashed on 29 Million, After doing some investigation it seems that when importing the large database CPU instantly shoots to 100% the minute you start the import and sits at 100%.

I'm using a optimized nginx php7 setup and that's the first time i have ever seen the cpu at 100%, Memory usage on stats seemed to remain very low, That being said i did not fancy watching it import another 30 million records to see what the cpu / memory usage is like at point of crash, though i could write a script to dump the log, I also think it hits 100% the second you perform a query aswel and drops back to normal when the results are returned.

The second concern i now have is when i do a query on the current 29 million records that was imported, It can take a few seconds to return the results depending on where about's in the index the search term is stored.

For example postcodes are stored alphabetically so if i search AB10 the query is very fast and returns the result set within 20ms...

If i query ZE10AQ then it takes around 5 seconds, You can see the delay increase as you dig further for query's... An example would be:

AB10 (Returns in 20ms)

DU19 (Returns in 1 second)

S64 (Returns in 2 seconds)

ZE10 (Returns in 3+ seconds)

This is also the same for keywords such as (Main Street) or (London), Is there anything we could do to improve this? Would be great if i could get some support to help achieve the end result and it will be an excellent demo / showcase for TNTSearch to show what is possible :)

After all doing full text and fuzzy searching on 30+ Million addresses with fast enough response time to make address autocompletion possible is a great achievement :)

nticaric commented 7 years ago

@BirdyUK drop us an email to info@tntstudio.hr and we'll help you to optimize this.

BirdyUK commented 7 years ago

@nticaric - Will do mate i will send you an email later tonight and will post a comment here to let you know when i have sent it. Look forward to speaking soon mate, Cheers :)

nticaric commented 7 years ago

Just released a new version that allows you to set the asYouType option in runtime. What you should try is to turn off the fuzziness in the config and enable asYouType

<?php

namespace App;

use Illuminate\Database\Eloquent\Model;
use Laravel\Scout\Searchable;

class Address extends Model
{
    use Searchable;
    public asYouType = true;
}

When you make the changes what is the execution time?

nticaric commented 7 years ago

Forgot to ask, what was the indexing time for those 20+ millions records?

BirdyUK commented 7 years ago

@nticaric - Sorry for the delay on getting back to you, I have been really busy working on some projects. I will send you an email over today to info@tntstudio.hr and hopefully i can get your help on getting the search functionality into a production state and also help you out with some funds. to help support you with TNTSearch. I will drop you an email and hopefully we can talk via a chat platform. -- Look forward to speaking with you soon. Thanks.

PseudoAj commented 7 years ago

Issue

I am facing the exact issue. I have about 1.1 Mil rows that I am trying to index with the tntsearch:import:

php artisan tntsearch:import App\\Correspondence

After 300K records I got the following error:

Commited
Processed 291000 rows
Processed 292000 rows
Processed 293000 rows
Processed 294000 rows
Processed 295000 rows
Processed 296000 rows
Processed 297000 rows
Processed 298000 rows
Processed 299000 rows
Processed 300000 rows
Commited
Processed 301000 rows
Processed 302000 rows
Processed 303000 rows
Processed 304000 rows
Processed 305000 rows
Processed 306000 rows
Processed 307000 rows 
PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 75497472 bytes) in /var/www/html/project-css/vendor/teamtnt/tntsearch/src/Indexer/TNTIndexer.php on line 393

  [Symfony\Component\Debug\Exception\FatalErrorException]                     
  Allowed memory size of 134217728 bytes exhausted (tried to allocate 754974  
  72 bytes)                                                                   

Environment

Following is my project development environment:

  1. Vagrant Centos
  2. Laravel 5.3
  3. Scout 2.0.*
  4. teamtnt/laravel-scout-tntsearch-driver 2.0.*

Question

For my application I don't want to increase the memory in php.ini like suggested here or run the artisan with memory option like @BirdyUK did in his case. Can I go ahead change this protected inMemory = true; in the model yet or do I have to go and manipulate the source?

jackie-d commented 7 years ago

I get the same error with L5.4 and Scout with TNTSearch trying to import 3M rows.

I would really like to be able to set the $inMemery = false; to my model, so this way would be really modular.

For the moment I'm running php -d memory_limit=-1 artisan tntsearch:import App\\MyModel on my dev env that's a MacBook Air 8GB RAM and It's completing with success.