Open loiclec opened 2 years ago
@loiclec is this PR still relevant?
@curquiza Yes, it's going to sit here for a very long time probably, but eventually I'll go back to it :)
I recommend you rebasing otherwise you will never have the force and motivation to get back to it 😇
Oh yes I know haha. I've made my peace with the fact I'll need to rewrite it from scratch when I get back to it. The idea behind the PR is more important than the details of it :)
This is mostly still an experimental, unpolished PR. It refactors the data extractors used during indexing.
The main ideas are to:
rayon
and usestd::thread::scope
andstd::thread::spawn
insteadThe advantages of this approach should be to:
Sorter
s used by the extractors will be able to process larger batches of documents before flushing their content to disk.rayon
, which removes a large dependency and makes the indexing code much easier to profile using tools such as flamegraphs and Instruments.appHowever, on machines with very small amount of RAMs, indexing documents which have many words in them could result in more file system pressure. This is because the extractors now have to share the same pool of memory between them, and so the
word_pair_proximity
extractor in particular could use all of its allocated memory faster, which would make it flush its content to disk more often.Remaining work to be done
For now, the number of threads is hard-coded to 8 and the available memory is hardcoded to 8GB. This should of course be changed to use the actual number of threads and memory available during indexing.
The amount of memory that each extractor is allowed to use is also hardcoded. For example, we give half of the total memory to the
word_pair_proximity
extractor. But ideally, this number should be determined based on the actual amount that each extractor needs to process the specific dataset. For example, thesongs
andgeo
datasets do not have a lot of word pairs, so it is counterproductive to allocate half of the memory to theword_pair_proximity
extractor for them.Proper error handling should be reintroduced
Benchmark results
The benchmarks so far are satisfactory. We gain up to ~20% in the
movies
benchmarks, probably due to the increased parallelism. In thewiki
benchmark, we gain ~10%, probably because we read/write/merge fewergrenad::Reader
. Thegeo
andsongs
datasets perform slightly worse (- 3–10%), which I think could be due to the unfair memory allocation between extractors.However, for this PR, it won't be enough to rely on these benchmarks, since the performance will change a lot based on the OS, number of threads, and memory that the machine has. For example, on my computer (MacBook Pro M1, 16GB RAM), indexing the
wiki
datasets is 4 times faster. There are probably some configurations which cause greater performance losses as well.I also expect the performance benefits will grow after https://github.com/meilisearch/milli/pull/639 is merged.
Proposed solution to allocate memory fairly between extractors
I'd like to introduce a new data structure, called
SorterPool
, which can manageN
grenad::Sorter
and allocate memory fairly between them.Initially, they each receive
MEM / N
memory. When any of the sorter in the pool exceeds its allocated memory, it flushes its content in a chunk. At that point, we need to tell the other sorters to also flush their content. We record how much memory each sorter used. Then, we reallocate a different amount of memory for each sorter based on its previous memory usage.Example
We have 4 sorters, S1..S4, and access to 4GB of memory. Initially. each sorter has access to 1GB of memory.
We start inserting elements in each sorter. After a while, we have this situation:
After S1 flushed its content in memory, we also ask S2..S4 to do the same. Then we reallocate the 4GB of memory as follows: