Refactor of the data extractors used during indexing

loiclec commented 2 years ago

This is mostly still an experimental, unpolished PR. It refactors the data extractors used during indexing.

The main ideas are to:

run all extractors together in the same threads instead of each separately
give the documents to the extractors one-by-one instead of by chunks of 4MB
get rid of rayon and use std::thread::scope and std::thread::spawn instead

The advantages of this approach should be to:

Ease the pressure on the file system by writing and reading fewer files. This is because the Sorters used by the extractors will be able to process larger batches of documents before flushing their content to disk.
Be able to fully use the extra memory and CPU cores of performant machines, even on smaller datasets. I also think we could see large benefits for single-threaded machines.
Simplify the code by making its control flow easier to follow
Be able to reuse the work done by one extractor in another one
Remove uses of rayon, which removes a large dependency and makes the indexing code much easier to profile using tools such as flamegraphs and Instruments.app

However, on machines with very small amount of RAMs, indexing documents which have many words in them could result in more file system pressure. This is because the extractors now have to share the same pool of memory between them, and so the word_pair_proximity extractor in particular could use all of its allocated memory faster, which would make it flush its content to disk more often.

Remaining work to be done

For now, the number of threads is hard-coded to 8 and the available memory is hardcoded to 8GB. This should of course be changed to use the actual number of threads and memory available during indexing.
The amount of memory that each extractor is allowed to use is also hardcoded. For example, we give half of the total memory to the word_pair_proximity extractor. But ideally, this number should be determined based on the actual amount that each extractor needs to process the specific dataset. For example, the songs and geo datasets do not have a lot of word pairs, so it is counterproductive to allocate half of the memory to the word_pair_proximity extractor for them.
Proper error handling should be reintroduced

Benchmark results

The benchmarks so far are satisfactory. We gain up to ~20% in the movies benchmarks, probably due to the increased parallelism. In the wiki benchmark, we gain ~10%, probably because we read/write/merge fewer grenad::Reader. The geo and songs datasets perform slightly worse (- 3–10%), which I think could be due to the unfair memory allocation between extractors.

However, for this PR, it won't be enough to rely on these benchmarks, since the performance will change a lot based on the OS, number of threads, and memory that the machine has. For example, on my computer (MacBook Pro M1, 16GB RAM), indexing the wiki datasets is 4 times faster. There are probably some configurations which cause greater performance losses as well.

I also expect the performance benefits will grow after https://github.com/meilisearch/milli/pull/639 is merged.

group                                                                     indexing_extractor-refactor_f16a8bd0    indexing_main_a18de9b5
-----                                                                     ------------------------------------    ----------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00     36.3±5.71ms        ? ?/sec     1.18     43.0±9.48ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.05      9.7±2.64ms        ? ?/sec     1.00      9.3±3.04ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00     12.0±2.28ms        ? ?/sec     1.18     14.2±4.57ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00     44.1±6.99ms        ? ?/sec     1.22    53.8±13.74ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.00   266.6±17.41ms        ? ?/sec     1.02   271.0±18.00ms        ? ?/sec
indexing/Indexing geo_point                                               1.03      49.2±1.06s        ? ?/sec     1.00      47.9±0.24s        ? ?/sec
indexing/Indexing movies in three batches                                 1.00      10.6±0.24s        ? ?/sec     1.21      12.8±0.12s        ? ?/sec
indexing/Indexing movies with default settings                            1.00       8.7±0.38s        ? ?/sec     1.19      10.4±0.11s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.00       7.7±0.07s        ? ?/sec     1.04       8.0±0.19s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.00       7.2±0.12s        ? ?/sec     1.03       7.4±0.11s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.11      53.5±1.79s        ? ?/sec     1.00      48.3±0.64s        ? ?/sec
indexing/Indexing songs with default settings                             1.02      46.9±0.53s        ? ?/sec     1.00      45.9±1.05s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      43.4±0.82s        ? ?/sec     1.00      43.5±0.56s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.01      46.0±0.90s        ? ?/sec     1.00      45.6±0.75s        ? ?/sec
indexing/Indexing wiki                                                    1.00     742.9±4.91s        ? ?/sec     1.21     901.2±8.34s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00     927.3±4.02s        ? ?/sec     1.02     949.8±9.12s        ? ?/sec
indexing/Reindexing geo_point                                             1.02      16.1±0.30s        ? ?/sec     1.00      15.7±0.05s        ? ?/sec
indexing/Reindexing movies with default settings                          1.13   304.8±27.56ms        ? ?/sec     1.00   270.6±20.98ms        ? ?/sec
indexing/Reindexing songs with default settings                           1.08       4.6±0.15s        ? ?/sec     1.00       4.2±0.09s        ? ?/sec
indexing/Reindexing wiki                                                  1.00    1386.5±2.29s        ? ?/sec     1.10   1518.6±20.58s        ? ?/sec

Proposed solution to allocate memory fairly between extractors

I'd like to introduce a new data structure, called SorterPool, which can manage N grenad::Sorter and allocate memory fairly between them.

Initially, they each receive MEM / N memory. When any of the sorter in the pool exceeds its allocated memory, it flushes its content in a chunk. At that point, we need to tell the other sorters to also flush their content. We record how much memory each sorter used. Then, we reallocate a different amount of memory for each sorter based on its previous memory usage.

Example

We have 4 sorters, S1..S4, and access to 4GB of memory. Initially. each sorter has access to 1GB of memory.

We start inserting elements in each sorter. After a while, we have this situation:

S1 has used its full 1GB of memory and will flush its content to disk
S2..S4 have all used only 100MB of memory

After S1 flushed its content in memory, we also ask S2..S4 to do the same. Then we reallocate the 4GB of memory as follows:

The total amount of memory that was used before was 1.3GB
S1 used 1GB/1.3GB = 77% of it
So S1 gets 77% of 4GB = 3.08GB
S2..S4 used 7.7% each, they each get 308MB

curquiza commented 1 year ago

@loiclec is this PR still relevant?

loiclec commented 1 year ago

@curquiza Yes, it's going to sit here for a very long time probably, but eventually I'll go back to it :)

curquiza commented 1 year ago

I recommend you rebasing otherwise you will never have the force and motivation to get back to it 😇

loiclec commented 1 year ago

Oh yes I know haha. I've made my peace with the fact I'll need to rewrite it from scratch when I get back to it. The idea behind the PR is more important than the details of it :)

meilisearch / milli