Open dureuill opened 1 year ago
This issue seems directly related to lmdb and not heed. @hyc do you think you could look into it? It reproduce 100% of the time on macOS?
After creating the tmp_env directories, worked fine here on Linux. Will try Mac shortly.
Works fine here
hyc@Howards-MacBook-Pro xyz % ./prog
Starting tmp_env_0
Starting tmp_env_5
Starting tmp_env_6
Starting tmp_env_2
Starting tmp_env_7
Starting tmp_env_3
Starting tmp_env_4
Starting tmp_env_1
Starting tmp_env_8
Starting tmp_env_9
ok env
ok env
ok env
ok env
ok env
ok env
ok env
ok env
ok env
ok env
ok!
hyc@Howards-MacBook-Pro xyz % uname -a
Darwin Howards-MacBook-Pro.local 21.1.0 Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:24 PDT 2021; root:xnu-8019.41.5~1/RELEASE_ARM64_T8101 arm64
hyc@Howards-MacBook-Pro xyz % sw_vers
ProductName: macOS
ProductVersion: 12.0.1
BuildVersion: 21A559
Ah, sorry, I used 10 threads in the linked example, but you need at least 11 to see the issue (at least on Mac M1).
I updated the example in the repository and in the issue description above.
It's failing in LOCK_MUTEX, which defaults to using semop() on MacOS. The manpage says
[EINVAL] No semaphore set corresponds to semid, or the process would exceed the system-
defined limit for the number of per-process SEM_UNDO structures.
So this appears to be an OS limitation. No idea if/how that's tunable, I leave that up to you.
Instead of using the default SysV Semaphores, you can compile mdb.c with -DMDB_USE_POSIX_SEM and this problem goes away. Unfortunately POSIX semaphores on MacOS aren't robust, killing a process that holds a semaphore will leave it locked.
And thus, if I understood correctly, limiting our number of threads to 10 would work?
I suppose so. Or just stop creating so many environments. Why are you using one environment per thread?
Meilisearch is not specifically creating one environment by thread, but when we run the tests, it produces this behavior as the tests are run in parallel. So the easy fix for that is to reduce the number of tests run at the same time in the CI.
The real issue is that on arch Linux, we had another issue returning an os error 22: we can't create any Meilisearch index on this OS. We will see if the above minimal reproducible example is the one that triggers this bug or not. We thought it was the same bug.
the tests are run in parallel
Or run tests in separate processes instead of separate threads.
Thank you for your insight, hyc, you pinpointing which resource is limited was really helpful.
Some points about this issue:
lmdb-rs
build issue (uses system ldmb instead of vendored lmdb, see this comment for more information).ipcs -S
(shell command) allows displaying the various IPC-related limits in macOS. The problematic one is semume: 10 (max # of undo entries per process)
heed
as an OS-specific behavior, as heed
users are not expected to read the code of lmdb
to find out that it uses OS resources that exist in a surprisingly limited number per process (the default value on MacOS is 10). I will open a PR on heed
later, suggesting an initial wording.
Bug description
When running the following example using heed, we get unwraps of error 22: invalid argument:
(see the associated repository for more information)
Raw lmdb reproducer
The issue can be further minimized in C, directly using the master branch (not master3) of lmdb instead of heed, with the following:
(see the associated repository for more information)
Likely related to https://github.com/meilisearch/meilisearch/issues/3017