Be more resilient when opening too many files

We use grenad in Meilisearch, and we often have the too many open files (os error 24) error, which stops the whole indexation. I want to propose a change in the way the grenad sorter currently works.

How does it work now?

The sorter allocates a big in-memory buffer
If we can insert entries into the in-memory buffer, we do and return to 2.
If there are less than 25 on-disk chunks, we create a new file, dump the buffer into it, and go to 2. Otherwise, we go to 4.
We create a new file and merge the 25 files into the 26th one, then delete all of them. Return to 2.

In this configuration, if an (error 24) is raised, we cannot do anything apart from returning the error above. The reason is that the in-memory buffer is full, so we cannot accept any new entry, and we cannot write the buffer's content into any chunk file as the content would be unordered.

https://github.com/meilisearch/grenad/blob/46e5e27a8ff328c22ded55278022dfd22ae32b09/src/sorter.rs#L470-L625

How can we improve that?

The sorter allocates a big in-memory buffer and one backup file.
If we can insert entries into the in-memory buffer, we do and return to 2.
If there are less than 25 on-disk chunks, we create a new file, dump the buffer into it, and go to 2. Otherwise, we go to 4.
We merge the 24 files into the backup file, then delete all of them but one that becomes the backup file. Return to 2.

In this configuration, if an (error 24) is raised, we can still merge the chunks file together into the backup file, dump the buffer's content into one of the chunks files, and keep one of the chunks files as a new backup file, dropping the others. The only moment we are not resilient to the (error 24) is at step 1., at creation time.

meilisearch / grenad

Be more resilient when opening too many files #48

How does it work now?

How can we improve that?