meilisearch / milli

Search engine library for Meilisearch ⚡️
MIT License
464 stars 81 forks source link

Refactor of the data extractors used during indexing #656

Open loiclec opened 2 years ago

loiclec commented 2 years ago

This is mostly still an experimental, unpolished PR. It refactors the data extractors used during indexing.

The main ideas are to:

  1. run all extractors together in the same threads instead of each separately
  2. give the documents to the extractors one-by-one instead of by chunks of 4MB
  3. get rid of rayon and use std::thread::scope and std::thread::spawn instead

The advantages of this approach should be to:

  1. Ease the pressure on the file system by writing and reading fewer files. This is because the Sorters used by the extractors will be able to process larger batches of documents before flushing their content to disk.
  2. Be able to fully use the extra memory and CPU cores of performant machines, even on smaller datasets. I also think we could see large benefits for single-threaded machines.
  3. Simplify the code by making its control flow easier to follow
  4. Be able to reuse the work done by one extractor in another one
  5. Remove uses of rayon, which removes a large dependency and makes the indexing code much easier to profile using tools such as flamegraphs and Instruments.app

However, on machines with very small amount of RAMs, indexing documents which have many words in them could result in more file system pressure. This is because the extractors now have to share the same pool of memory between them, and so the word_pair_proximity extractor in particular could use all of its allocated memory faster, which would make it flush its content to disk more often.

Remaining work to be done

Benchmark results

The benchmarks so far are satisfactory. We gain up to ~20% in the movies benchmarks, probably due to the increased parallelism. In the wiki benchmark, we gain ~10%, probably because we read/write/merge fewer grenad::Reader. The geo and songs datasets perform slightly worse (- 3–10%), which I think could be due to the unfair memory allocation between extractors.

However, for this PR, it won't be enough to rely on these benchmarks, since the performance will change a lot based on the OS, number of threads, and memory that the machine has. For example, on my computer (MacBook Pro M1, 16GB RAM), indexing the wiki datasets is 4 times faster. There are probably some configurations which cause greater performance losses as well.

I also expect the performance benefits will grow after https://github.com/meilisearch/milli/pull/639 is merged.

group                                                                     indexing_extractor-refactor_f16a8bd0    indexing_main_a18de9b5
-----                                                                     ------------------------------------    ----------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00     36.3±5.71ms        ? ?/sec     1.18     43.0±9.48ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.05      9.7±2.64ms        ? ?/sec     1.00      9.3±3.04ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00     12.0±2.28ms        ? ?/sec     1.18     14.2±4.57ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00     44.1±6.99ms        ? ?/sec     1.22    53.8±13.74ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.00   266.6±17.41ms        ? ?/sec     1.02   271.0±18.00ms        ? ?/sec
indexing/Indexing geo_point                                               1.03      49.2±1.06s        ? ?/sec     1.00      47.9±0.24s        ? ?/sec
indexing/Indexing movies in three batches                                 1.00      10.6±0.24s        ? ?/sec     1.21      12.8±0.12s        ? ?/sec
indexing/Indexing movies with default settings                            1.00       8.7±0.38s        ? ?/sec     1.19      10.4±0.11s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.00       7.7±0.07s        ? ?/sec     1.04       8.0±0.19s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.00       7.2±0.12s        ? ?/sec     1.03       7.4±0.11s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.11      53.5±1.79s        ? ?/sec     1.00      48.3±0.64s        ? ?/sec
indexing/Indexing songs with default settings                             1.02      46.9±0.53s        ? ?/sec     1.00      45.9±1.05s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      43.4±0.82s        ? ?/sec     1.00      43.5±0.56s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.01      46.0±0.90s        ? ?/sec     1.00      45.6±0.75s        ? ?/sec
indexing/Indexing wiki                                                    1.00     742.9±4.91s        ? ?/sec     1.21     901.2±8.34s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00     927.3±4.02s        ? ?/sec     1.02     949.8±9.12s        ? ?/sec
indexing/Reindexing geo_point                                             1.02      16.1±0.30s        ? ?/sec     1.00      15.7±0.05s        ? ?/sec
indexing/Reindexing movies with default settings                          1.13   304.8±27.56ms        ? ?/sec     1.00   270.6±20.98ms        ? ?/sec
indexing/Reindexing songs with default settings                           1.08       4.6±0.15s        ? ?/sec     1.00       4.2±0.09s        ? ?/sec
indexing/Reindexing wiki                                                  1.00    1386.5±2.29s        ? ?/sec     1.10   1518.6±20.58s        ? ?/sec

Proposed solution to allocate memory fairly between extractors

I'd like to introduce a new data structure, called SorterPool, which can manage N grenad::Sorter and allocate memory fairly between them.

Initially, they each receive MEM / N memory. When any of the sorter in the pool exceeds its allocated memory, it flushes its content in a chunk. At that point, we need to tell the other sorters to also flush their content. We record how much memory each sorter used. Then, we reallocate a different amount of memory for each sorter based on its previous memory usage.

Example

We have 4 sorters, S1..S4, and access to 4GB of memory. Initially. each sorter has access to 1GB of memory.

We start inserting elements in each sorter. After a while, we have this situation:

After S1 flushed its content in memory, we also ask S2..S4 to do the same. Then we reallocate the 4GB of memory as follows:

curquiza commented 1 year ago

@loiclec is this PR still relevant?

loiclec commented 1 year ago

@curquiza Yes, it's going to sit here for a very long time probably, but eventually I'll go back to it :)

curquiza commented 1 year ago

I recommend you rebasing otherwise you will never have the force and motivation to get back to it 😇

loiclec commented 1 year ago

Oh yes I know haha. I've made my peace with the fact I'll need to rewrite it from scratch when I get back to it. The idea behind the PR is more important than the details of it :)