meilisearch / meilisearch

A lightning-fast search API that fits effortlessly into your apps, websites, and workflow
https://www.meilisearch.com
MIT License
47.37k stars 1.84k forks source link

Migrating 40M records from Postgres to Meilisearch takes too long #1256

Closed aaqibjavith closed 2 years ago

aaqibjavith commented 3 years ago

I am trying to migrate 40M Postgres records (user comments) to Melisearch. Each time, I am querying 10K records from Postgres and writing to Meilisearch. The migration script has been completed. But it has been more than 3 days, only 4.2 Million records only migrated. The count is increasing very slowly.

I also see the following error messages in the error log:

[2021-02-18T12:48:44Z INFO  ureq::unit] sending request POST https://api.amplitude.com/httpapi
[2021-02-18T13:14:58Z ERROR meilisearch_core::database] commit nested transaction failed: Input/output error (os error 5)
[2021-02-18T13:43:50Z ERROR meilisearch_core::database] commit nested transaction failed: Input/output error (os error 5)
[2021-02-18T13:48:45Z INFO  ureq::unit] sending request POST https://api.amplitude.com/httpapi
[2021-02-18T14:13:03Z ERROR meilisearch_core::database] commit nested transaction failed: Input/output error (os error 5)
[2021-02-18T14:42:05Z ERROR meilisearch_core::database] commit nested transaction failed: Input/output error (os error 5)
[2021-02-18T14:48:45Z INFO  ureq::unit] sending request POST https://api.amplitude.com/httpapi
[2021-02-18T15:12:06Z ERROR meilisearch_core::database] commit nested transaction failed: Input/output error (os error 5)
[2021-02-18T15:41:40Z ERROR meilisearch_core::database] commit nested transaction failed: Input/output error (os error 5)
[2021-02-18T15:48:46Z INFO  ureq::unit] sending request POST https://api.amplitude.com/httpapi

Also, I tried to get updates using the following curl

curl 'http://localhost:7700/indexes/comments/updates' | jq

It is running forever, not getting any result from the server.

  1. What is commit nested transaction failed and How to fix it?
  2. Is there any best practice to write large amounts data to meilisearch?
  3. Is there any limitation in the amount of data to be handled by meilisearch?
  4. Is there an option to disable amplitude data?
curquiza commented 3 years ago

Hello @aaqibjavith!

Thanks for your feedback and for trying MeiliSearch.

The current search engine in MeiliSearch is currently not able to handle this high number of documents. This is a known issue and the core-team is currently working hard on a new search engine that might be able to handle this quantity of documents! πŸš€

For the moment, we recommend pushing your documents in larger batches, don't forget to increase the Payload Limit Size.

Be sure we will keep you informed when the new search engine is available 😁

tmikaeld commented 3 years ago

@curquiza While we wait, how many documents are recommended for the current engine?

curquiza commented 3 years ago

It depends on many parameters like the size of your documents. The best is to try: add your documents in batches (you can increase the default payload size). But in any case, I would say less than 1M of records.

tmikaeld commented 3 years ago

@curquiza Alright, with that considered we will wait until the new engine has been released. Do you have an estimate?

curquiza commented 3 years ago

We hope we can release the v0.21.0 (containing the new version of the search engine) by the end of June. At least, we should be able to provide release-candidate version of the v0.21.0 by the end of June πŸ™‚

tmikaeld commented 3 years ago

That's much earlier than I expected, I'll wait until then and try it out. I'm guessing it will be able to handle many millions of documents?

irevoire commented 3 years ago

If I remember correctly the biggest dataset we are currently using to test the new engine is ~120M documents.

curquiza commented 3 years ago

I'm guessing it will be able to handle many millions of documents?

Yes πŸ™‚

tmikaeld commented 3 years ago

That's amazing, thanks for the feedback!

setop commented 3 years ago

Same here :

In injected 220k documents in 2k seconds. That is around 100_doc/sec which is cool.

But indexation is currently running at 1.5_doc/sec, hogging 70% of a core of my machine.

This mean I'll get my documents after two and a half days.

curquiza commented 3 years ago

Hello @setop! Thanks for the feedback!

Even if the current MeiliSearch has performance troubles, the indexing time you get for your dataset is not really expected. I personally have just indexed 160k documents in less than 5min in my local machine.

The huge amount of time might be reduced by the way you pushed your documents. Do you push your documents in batch or one by one? Pushing your documents in batch increases drastically the performance. I pushed my documents in batches of 1800 documents, which means I was able to index my documents with less than 100 enqueued updates into MeiliSearch. You can increase the default payload size of MeiliSearch: https://docs.meilisearch.com/reference/features/configuration.html#payload-limit-size

However, my documents are small (6 small attributes per document). Maybe yours are really big, which can explain this huge time.

setop commented 3 years ago

I indeed enqueue them one by one. It is much more convenient to me first because ... it is more convenient :) and also because error can be raised for one document (during data preparation or when ingesting) and I don't want this error to propagate to siblings.

Honestly, it is not obvious to me, without digging into implementation, why batch improve things. To me, indexing should cost much much more than iterating over input. But that is true that the API suggest that is should be done by batch since it only accept a list of documents and not a single document.

I currently manage to ingest 19k documents (even if ingestion is going slower, less than 0.5_doc/sec). And made some test on search, and this part is terrific ! Blazing fast, accurate, typo tolerant. Love it.

I'll have a try at modifying my ingestion pipeline to work in batch.

setop commented 3 years ago

I retried with 1k documents batches. It ingest 25k doc in few seconds :rocket: 259MB raw data, 1.8GB index size.

curquiza commented 3 years ago

Awesome @setop, I'm going to open an issue in the docs, pushing by batch is not obvious for everyone. To be honest that's a tip I give multiple times a week, you're not the only one πŸ™‚ Glad you succeeded to index your data! Edit: I added a comment to this already existing issue: https://github.com/meilisearch/documentation/issues/875#issuecomment-831115956

curquiza commented 3 years ago

Hello @setop, @tmikaeld, @aaqibjavith and everyone following the issue!

The first RC of MeiliSearch v0.21.0 is out. We did our best to fix indexation and crash issues. We succeeded to improve it, but not to totally fix them.

You can test this new release by downloading the binaries available in this release. Or you can use it with docker:

docker run -p 7700:7700 getmeili/meilisearch:v0.21.0rc1 ./meilisearch

We will still improve this after the release of the v0.21.0. We would rather release a non-completely optimized version rather than delay it and, at the same time, delay the release of new features. Be sure we are doing our best to always improve these indexation issues.

As a reminder:

Thanks for your patience and your help with this! ❀️

quangtam commented 3 years ago

Hello @setop, @tmikaeld, @aaqibjavith and everyone following the issue!

The first RC of MeiliSearch v0.21.0 is out. We did our best to fix indexation and crash issues. We succeeded to improve it, but not to totally fix them.

You can test this new release by downloading the binaries available in this release. Or you can use it with docker:

docker run -p 7700:7700 getmeili/meilisearch:v0.21.0rc1 ./meilisearch

We will still improve this after the release of the v0.21.0. We would rather release a non-completely optimized version rather than delay it and, at the same time, delay the release of new features. Be sure we are doing our best to always improve these indexation issues.

As a reminder:

  • despite the improvements, we always recommend pushing your documents to MeiliSearch by batch and not one by one. The maximum payload size of MeiliSearch is 100Mb and can be optionally increased. It means most of the dataset can be pushed with one call to the API.
  • If you still have a memory crash, it means the RAM of your machine might not be adapted to your dataset size. We recommend increasing the RAM of your machine.

Thanks for your patience and your help with this! ❀️

I using v0.21.0 and trying to import 28M records, import is very fast but currently index still in progress and only 85K records was indexed. At my case, my server has 8GB of RAM and 4 Cores of CPU.

Do you have any recommend for me about max payload size? I've use default value.

Thanks & Best regards,

curquiza commented 3 years ago

Hello @quangtam! Thank you for your feedback πŸ˜„

quangtam commented 3 years ago

Hi @curquiza

I've installed v0.22.0rc1 and testing, I can see it's faster than v0.21.0 and currently after 167mins (2h47m) it's indexed for 342K records (~2000 records/min) and slowing down. Around first 100k, speed around 9000 records/min. My hardware:

Also, I've faced one time crash (was killed by system) during running indexing but RAM still free 6/8GB of RAM

Screen Shot 2021-09-09 at 21 53 19
curquiza commented 3 years ago

Thank you for your quick report πŸ™‚

It seems really slow for the number of documents, but maybe each document is big. What is the total size of your dataset?

Also, I've faced one time crash (was killed by system) during running indexing but RAM still free 6/8GB of RAM

Weird... We had a similar issue (killed by the OS but the RAM was not fully used) because of a lack of space on the disk. Could it be your situation?

quangtam commented 3 years ago

Hi @curquiza,

I have ~28M records in MySQL (5.9GB in DB). I've read your documents about virtual memory but when meilisearch crash I haven't check hard disk. But currently, I have 270GB free on the disk (total is 300GB), should I increase?

curquiza commented 3 years ago

I have ~28M records in MySQL (5.9GB in DB)

Ok, I thought your dataset was 300K of documents, not 28M.

But currently, I have 270GB free on the disk (total is 300GB), should I increase?

Hum, since I don't have your dataset, it's not easy to guess. Maybe increase it could solve, but I'm not sure at all.

One of our dev is currently investigating on a similar issue right now.

quangtam commented 3 years ago

Hi @curquiza,

Here is my table information

Screen Shot 2021-09-10 at 08 05 04
quangtam commented 3 years ago

After 4 days, the MeiliSearch still in-process to indexing 2.8M rows of record (version 0.22.rc01):

=> new version is faster but stills too slow.

barryvdh commented 2 years ago

As replied in #1098, I'm having similar issues with 0.23:

I'm having the same issue, using the following setup:

  • Laravel Scout
  • MeiliSearch 0.23
  • Using Laravel Forge, MeiliSearch server, AWS m5.xlarge (4cpu, 16GB)
  • ~50 indexes (1 index per tenant/client)
  • ~1 Million documents in total (with 30 attributes, either integers or small strings (<250 chars))
  • Initial import took around 3-4 hour to complete (in batches of 500 documents)
  • Additional updates took around 15 seconds to complete (checking the status on the API)

This caused the backlog to never catch-up. 1 of 4 CPU's was at 100% the entire time. (I understand that 50% is reserved for search queries, but expected 2 CPU's to be used for indexing).

On #1256 it stated that above 1M documents wasn't supported pre-0.21, but should be now? Also the index time should have improved, but according to my tests and a random person on the internet (https://twitter.com/msnegurski/status/1460705253033185285) is hasn't. Are these numbers out of the scope for MeiliSearch? Is the indexation time expected, or have I misconfigured something? (I haven't actually configured anything except spin up the Forge Meili server). I understand batching is preferred, but that's not currently how Scout works. Also, a delay of 1 minute is really stretching the usability (eg. creating orders should result in the orders being visible in the orders index almost directly).

I tested the same setup (same documents/indexes/attributes, also using 1 document updates, using Laravel Scout) with AWS OpenSearch (using the ElasticSearch Scout driver), on a similar setup (r6g.large.search, 2CPU/16GB, same price as m5.xlarge) and the initial batch import was 15 minutes, and separate updates where indexed instantly (<1 sec) without any delays, on the same load as Meili. This make me wonder if: A. This workload is unrealistic for MeiliSearch B. I'm doing something wrong C. This is a bug?

fibis commented 2 years ago

I tested the same setup (same documents/indexes/attributes, also using 1 document updates, using Laravel Scout) with AWS OpenSearch (using the ElasticSearch Scout driver), on a similar setup (r6g.large.search, 2CPU/16GB, same price as m5.xlarge) and the initial batch import was 15 minutes, and separate updates where indexed instantly (<1 sec) without any delays, on the same load as Meili. This make me wonder if: A. This workload is unrealistic for MeiliSearch B. I'm doing something wrong C. This is a bug?

We have here the same problem. Using Meilisearch with Laravel Scout and a huge amount of data didn't work. @barryvdh which ElasticSearch Scout driver did you use?

curquiza commented 2 years ago

Hello everyone reading this issue.

For clarity and simplicity, we decided to gather all the indexation issues and feedback in one place -> this discussion

Please let any feedback in this discussion, this would will be really appreciated πŸ™‚