Mongod sometimes fails to start due to ETXTBSY

Simon-Hayden-iteratec commented 3 months ago

Versions

NodeJS: ^20 (Github Action uses actions/setup-node@v3 with node-version: 20)
mongodb-memory-server: 9.1.6
mongodb(the binary version): 0
mongodb(the js package): 6.3.0 (from mongoose
mongoose: 8.1.2
system: Linux, most likely Ubuntu, but we are running on a selfhosted runner provided by my client

package: mongo-memory-server

What is the Problem?

We have a project with two integration tests using the mongodb-memory-server with a replica set because we are using Change Streams. When we run the tests on the CI, we sometimes get the this error (full logs below):

2024-05-15T07:08:36.6267264Z 2024-05-15T07:08:27.279Z MongoMS:MongoMemoryServer Mongo[unknown]: _startUpInstance threw a Error:  Error: spawn ETXTBSY
2024-05-15T07:08:36.6269314Z     at ChildProcess.spawn (node:internal/child_process:421:11)
2024-05-15T07:08:36.6270360Z     at spawn (node:child_process:761:9)
2024-05-15T07:08:36.6273299Z     at MongoInstance._launchMongod (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/util/MongoInstance.ts:504:31)
2024-05-15T07:08:36.6277477Z     at MongoInstance.start (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/util/MongoInstance.ts:392:31)
2024-05-15T07:08:36.6281617Z     at Function.create (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/util/MongoInstance.ts:292:5)
2024-05-15T07:08:36.6286043Z     at MongoMemoryServer._startUpInstance (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/MongoMemoryServer.ts:490:22)
2024-05-15T07:08:36.6290386Z     at MongoMemoryServer.start (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/MongoMemoryServer.ts:297:5)
2024-05-15T07:08:36.6293230Z     at async Promise.all (index 0)
2024-05-15T07:08:36.6296114Z     at MongoMemoryReplSet.initAllServers (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/MongoMemoryReplSet.ts:468:5)
2024-05-15T07:08:36.6300501Z     at MongoMemoryReplSet.start (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/MongoMemoryReplSet.ts:380:5)
2024-05-15T07:08:36.6304725Z     at Function.create (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/MongoMemoryReplSet.ts:193:5)
2024-05-15T07:08:36.6308684Z     at Object.<anonymous> (/home/runner/work/<redacted>/<redacted>/libs/backend/db/src/lib/mongodb/services/user-follow-entity/user-follow-entity.service.spec.ts:337:21) {
2024-05-15T07:08:36.6310935Z   errno: -26,
2024-05-15T07:08:36.6311790Z   code: 'ETXTBSY',
2024-05-15T07:08:36.6312428Z   syscall: 'spawn'
2024-05-15T07:08:36.6313120Z }

After some googling, the errno ETXTBSY is supposed to mean "Text file busy". So I assume there is some race condition, which disallows mongod to be called/spawned twice in quick succession. However, I did not find any information to mongod + ETXTBSY.

Because I assume this to be a race condition issue, it is also hard to reproduce. Locally, this error has not yet happened on any dev machine. On the CI, however, I'd say this error happens around 1/3 to 2/3 of runs. My assumption is that the CI cores are just slower and thus give a larger time window for the race condition to happen? Or maybe there is some weird OS issue with the runner, but I couldn't find anything providing an OS fix for this error either.

As a workaround, we are running tests with the --runInBand flag to instruct Jest to run one test after another. This is okay-ish for us, because we are using Nx to manage the monorepo, so we still parallelize based on Nx projects. However, if we have a second project which also needs to run MongoDB integration tests, I'd assume that the error would return.

Code Example

// Wish I could pin it down to a specific code example
// My best bet: Have multiple tests starting an in-memory MongoDB and run them in parallel.

Debug Output

The logs are quite verbose:

https://gist.github.com/Simon-Hayden-iteratec/93a2a822a8a081d987b4b427061adac4

Something I noticed is that it looks like there are multiple downloads of the mongod binary. Maybe the issue is also caused by process A still downloading the binary while process B tries to start mongod?

Do you know why it happenes?

Sadly no

hasezoey commented 3 months ago

Because I assume this to be a race condition issue, it is also hard to reproduce

it is, and my best guess is that is that the problem is the following happens (according to the logs, and my best guesses):

worker 1 tries to start, does not find binary, acquires download lock and downloads
worker 2 tries to start, does not find binary, tries to acquire download lock, but exists and so waits until done
worker 1 starts to extract the binary
worker 3 tries to start, does find the binary, and tries to use that binary (but its incomplete, so ETXTBUSY)
worker 1 finished extracting the binary

this is at least my best guess from the logs, currently the checks are:

if file exists, try to use it
if file does not exist, look for the download lock

i guess i need to add a check to see if the download lock still exists before trying to use the binary

Simon-Hayden-iteratec commented 3 months ago

Would be amazing if this is the issue and you could fix it :smile:

Though I wonder, why we are the only once reporting on this, though. As this race condition you're describing could have happened for anybody, there should have been at least some report already?

Anyway, thanks a lot for your fast reply :)

hasezoey commented 3 months ago

Though I wonder, why we are the only once reporting on this, though. As this race condition you're describing could have happened for anybody, there should have been at least some report already?

i guess the extraction part is quite quick and so is basically "atomic" on any other system (the archive is about ~70mb and the extracted binary around ~130mb to ~180mb). and mostly the instances are run on the same process instead of separate processes (like globalSetup instead of in each setupFile / beforeAll; thanks to how jest works), because if they are in the same process they would likely wait (dont know the specifics for that at the moment)

hasezoey commented 3 months ago

put in a fix in c60060977e91cd2c235d86d5ac27d194a7a9e3e3, which will be available as 9.2.1-beta.1, please test it for a while once available and report back (there will be a bot message)

github-actions[bot] commented 3 months ago

:tada: This issue has been resolved in version 9.2.1-beta.1 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

Simon-Hayden-iteratec commented 3 months ago

I'll give it a spin and see if we get any errors

Simon-Hayden-iteratec commented 3 months ago

I've ran the pipeline now five times with the beta version and it didn't fail. But I also can't find the log message "binary found, but also a download-lock, trying to resolve lock" anywhere, which most likely means that I didn't trigger the race condition this time.

I'm not sure if something else is currently preventing the issue (e.g. lower load on the runner?) but I will keep an eye on this. If there are more failed runs I will let you know. But maybe in a new ticket, so this one can be closed.

Thanks for your effort and quick fix implementation :1st_place_medal:

hasezoey commented 3 months ago

i will give it a few more days open as there has not been the message you were searching for, at the latest it will be closed once the stale message comes up (when i forgot to close beforehand)

Simon-Hayden-iteratec commented 3 months ago

So our test pipeline hasn't been failing the last few days. After looking at the logs I also found at least one, which hat the log line binary found, but also a download-lock, trying to resolve lock in it. To me that sounds like the race condition is fixed :tada:

I'll close the bug report now. Thanks again for your super fast help :)

github-actions[bot] commented 3 months ago

:tada: This issue has been resolved in version 9.3.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

typegoose / mongodb-memory-server