MongoBulkWriteError: Server selection timed out after 30000 ms with replSet: 2

typegoose / mongodb-memory-server

Manage & spin up mongodb server binaries with zero(or slight) configuration for tests.

https://typegoose.github.io/mongodb-memory-server/

MIT License

2.57k stars 186 forks source link

MongoBulkWriteError: Server selection timed out after 30000 ms with replSet: 2 #733

Closed simllll closed 1 year ago

simllll commented 1 year ago

Versions

NodeJS: 16.x (but alos happnes with 18.x)
mongodb-memory-server-*: 8.10.2
mongodb(the binary version): 6.0.3 (wasn't expeirence the issue with 5.x)
mongodb(the js package):4.11.0
mongoose: 6.7.2
system: Linux (Ubuntu) - running on github actions

package: mongo-memory-server-core

What is the Problem?

Our test suite is kind of flacky, because mongodb memory server with replSet: 2 sometimes throws MongoBulkWriteError: Server selection timed out after 30000 ms

Debug Output

We have a huge debug output, and I cannot tell for sure which part of it is really relevant. The problem here is that it runs a whole test suite, and therefore the output is kinda mixed up with other lines.

Here is the part that I guess is the most relevant one of it: https://gist.github.com/simllll/5aae130760cec9c5c37280e55e4a4b48

Do you know why it happenes?

Not sure, but probably related that no master can be evaluated?

The log starts at 2023-01-12T15:41:52.6771182Z 2023-01-12T15:41:13.825Z MongoMS:MongoMemoryReplSet initAllServers: waiting for all servers to finish starting

but somehow already at 2023-01-12T15:41:52.8240530Z 2023-01-12T15:41:46.479Z MongoMS:MongoMemoryReplSet stop: called by manual 2023-01-12T15:41:52.8241005Z 2023-01-12T15:41:46.479Z MongoMS:MongoMemoryServer Mongo[40771]: stop: Called .stop() method

it tells stop has been called manually? this doesn't sound like the 30 seconds?

hasezoey commented 1 year ago

because mongodb memory server with replSet: 2

it is recommended to run odd numbers of replsets, and discouraged to run even numbers (by mongodb itself)

also for MMS it is recommended to run either 1 or directly 3 instances in a replset

it tells stop has been called manually?

i know the phrasing could be better, but this basically means that .stop was not called in some process.on('exit') event

https://github.com/nodkz/mongodb-memory-server/blob/3ed0d9737e68c27847c4cda56c99299905787e43/packages/mongodb-memory-server-core/src/MongoMemoryReplSet.ts#L521

this doesn't sound like the 30 seconds?

i am not quite sure what 30 seconds you are referring to, do you mean the sever selection timeout?

Debug log dissection:

at Line 460 you can see that the replset is started and has elected its first primary, which is a event MMS waits for before saying the replset is finished starting

at Line 864 and Line 873 you can see that the current Primary cannot see the other part of the replset (which in this case is just one instance) and so transitions to not be a primary anymore, effectively making the whole replset read-only (if not completely unaccessable) until another primary can be elected

at Line 963 you can see that a Primary has been elected again

at Line 1065 you can see that the .stop method has been called

the whole process between first primary, stepping down and electing another primary has not even been 2 seconds

could you provide when and how your test-suite fails, and also what your connection looks like (like mongoose.connect() or mongoose.createConnection())?

simllll commented 1 year ago

THanks for your response, first: yeah I was referring to the error message which states the 30 seconds. Which kinda confused me because in the log this is happening within some seconds (like you said), and never takes more than 30 seconds at all.

Test output:

 71 passing (50s)
         1 failing

"before all" hook in "...":
            MongoBulkWriteError: Server selection timed out after 30000 ms
             at resultHandler (/home/simon/Dev/hokify/hokify-server/node_modules/mongodb/src/bulk/common.ts:600:9)
             at /home/simon/Dev/hokify/hokify-server/node_modules/mongodb/src/utils.ts:463:14

        ELIFECYCLE  Test failed. See above for more details.

This is already the error of the test suite,

the before() hook calls this.mongo = await mockMongoose({ count: 2 });

and the code of mockMongoose is something liek this:

const mongod = await (replSet ? MongoMemoryReplSet : MongoMemoryServer).create({
        replSet,
        instance: {
            // this is needed because the first time it runs it downloads the binary
            launchTimeout: 120000
        },
        binary: {
            version: '6.0.3',
            // set to package directory, so it can be reused
            downloadDir: path.resolve(__dirname, '../../.cache/mongodb-binaries')
        }
    });

    const uri = mongod.getUri();

const self: IMockMongoose = {
        mongod,
        mongoose: await mongoose.connect(uri),
        async disconnect() {
            await self.mongoose.disconnect();
            await self.mongod.stop({ doCleanup: true });
            mongooseLog('stopped');
        }
    };

    return self;
}

does this help? And just to be sure, this is only happening from time to time.. not always. If I reperat the test right afterwards, it usually completes fine.

hasezoey commented 1 year ago

71 passing (50s) 1 failing

does this problem only occur when 1 is failing?

this problem looks like some connection is not properly closed after the test(s) have ended - could you maybe try the following and see if it fails?

it('test', async () => {
  console.log("before")
  this.timeout(1000 * 60);
  await new Promise(res => sleep(res, 1000 * 60));
  console.log("after");
});

and see if it fails while running the test (expecting the mongodb error) maybe even try to do some mongodb operations in-between

side-note:

// this is needed because the first time it runs it downloads the binary launchTimeout: 120000

since 8.10.1 (619d1fab5cece44865e40261bab9313027dbe44f), the launchTimeout does not impact binary downloads (which was a bug before), now it only refers to the time between actually trying to start the binary and not getting any error or success before that time

simllll commented 1 year ago

there is only one test that uses replSet: 2, that's why only one is failing. It has nothing todo with the other tests, with a lot of retries I can also reproduce it by just running the one test alone.

I tried to break it down, by creating a simple script that just tries to repeat the process till it fails:

import 'reflect-metadata';
import { mockMongoose } from '../mocks';

(async () => {
    for (let i = 0; i < 300; i++) {
        console.log('go', i);
        const m = await mockMongoose({ count: 2 });
        await m.disconnect();
        console.log('done', i);
    }
})();

at least locally, this didn't break.. so I was thinking, as I remember that jest was also doing weird things somethings, maybe it's related to mocha... so I wrapped the whole thing inside a test:

it('should mock mongodb replset', async function startMongoDb() {
        for (let i = 0; i < 300; i++) {
            console.log('go', i);
            const m = await mockMongoose({ count: 2 });
            await m.disconnect();
            console.log('done', i);
        }
    });

but I still couldn't reproduce it... so right now I'm kinda out of ideas to track this down further... any other suggestions? anything in the logs that you can see that tells us why this is happening?

hasezoey commented 1 year ago

but I still couldn't reproduce it... so right now I'm kinda out of ideas to track this down further... any other suggestions? anything in the logs that you can see that tells us why this is happening?

you could maybe provide the full failing test (plus before / after hooks) to see if there is anything wrong with it (if required, you can also just private message me with the code, if not able to publicly share it)

as for the logs, nothing that would indicate wrong behavior on mongodb's part (only the mentioned election problems of even numbers)

simllll commented 1 year ago

Thanks @hasezoey for your suggestion, unfortuantely I'm unable to provide the whole test case, but I'm still trying to break it down.. something that you also suggested before I guess, is now a bit more clear to me. It seems if another test fails and the test suite cancels the run (at least that's what I'm guessing right now), it produces the "Server selectoin timeout" error too... maybe the culprint is therefore not even directly related with the replset, but juts how the mongo client handles the situation on termination in this case...

Update 1: It's not about failing of other tests, but it seems to be related to parallel executions of tests. If I run it with lerna run test --concurrency 1 there seems to be no issue, but as soon as I increase the concurrency to 2 or more, it fails quite a lot....

Update 2: Mongoose actually connects, but the first query fails. await mongoose.connect(uri) succeeds. but then e.g. await DBJobFieldModel.create(...) fails with the error selection timed out error.

hasezoey commented 1 year ago

It's not about failing of other tests, but it seems to be related to parallel executions of tests

what are you using, jest or mocha (or something else)? also do you open a new replset for every test suite (/ file) or only once in a global-setup / global-teardown (which is recommended)?

simllll commented 1 year ago

we are using mocha. yes because the test run independent and in parallel, we have an own mongo instance for each test suite. running it globally would interfere with the test runs because they all seed different data and initial states.

hasezoey commented 1 year ago

we are using mocha.

then i would like to warn that mocha unlike jest re-uses test environments, so globals / module / package settings may leak to another test-suite (i got to know this the hard way while doing mongoose test debugging), though i dont know how it handles parallel tests.

in jest with parallel it just works in the testing i have done (MMS test suite itself, and typegoose for example), if you really need splitting, then consider using different databases on the same instance (like the mongoClient.useDb('differentDb') call). also starting just one instance globally and stopping it at the end may make your tests way faster (and probably more reliable with how often mongodb just may decide to not accept operations)

if you want a example for mocha global, then i would suggest looking at how mongoose does it:

mocha-fixtures.js: the actual file that starts and stops the instance(s)
.mocharc.yml: setting mocha-fixtures file to actually be run as a global fixture
common.js: connection to the instances
deno.js: in case mocha needs to be set programmatically instead for running via a "binary"

simllll commented 1 year ago

Thanks I will look into this, just for documentation purposes, I have some insights: 1.) The mongoose connections actually connects 2.) running const status = await mongooseConnection.connection.db.admin().command({ replSetGetStatus: 1 }); returns

{
         set: 'testset',
         date: 2023-01-17T15:40:56.205Z,
         myState: 1,
         term: 1,
         syncSourceHost: '',
         syncSourceId: -1,
         heartbeatIntervalMillis: 2000,
         majorityVoteCount: 2,
         writeMajorityCount: 2,
         votingMembersCount: 2,
         writableVotingMembersCount: 2,
         optimes: {
           lastCommittedOpTime: { ts: new Timestamp({ t: 0, i: 0 }), t: -1 },
           lastCommittedWallTime: 1970-01-01T00:00:00.000Z,
           appliedOpTime: { ts: new Timestamp({ t: 1673970056, i: 9 }), t: 1 },
           durableOpTime: { ts: new Timestamp({ t: 1673970055, i: 1 }), t: -1 },
           lastAppliedWallTime: 2023-01-17T15:40:56.171Z,
           lastDurableWallTime: 2023-01-17T15:40:55.583Z
         },
         electionCandidateMetrics: {
           lastElectionReason: 'electionTimeout',
           lastElectionDate: 2023-01-17T15:40:56.134Z,
           electionTerm: 1,
           lastCommittedOpTimeAtElection: { ts: new Timestamp({ t: 0, i: 0 }), t: -1 },
           lastSeenOpTimeAtElection: { ts: new Timestamp({ t: 1673970055, i: 1 }), t: -1 },
           numVotesNeeded: 2,
           priorityAtElection: 1,
           electionTimeoutMillis: 500,
           numCatchUpOps: 0,
           newTermStartDate: 2023-01-17T15:40:56.158Z
         },
         members: [
           {
             _id: 0,
             name: '127.0.0.1:40527',
             health: 1,
             state: 1,
             stateStr: 'PRIMARY',
             uptime: 2,
             optime: [Object],
             optimeDate: 2023-01-17T15:40:56.000Z,
             lastAppliedWallTime: 2023-01-17T15:40:56.171Z,
             lastDurableWallTime: 2023-01-17T15:40:55.583Z,
             syncSourceHost: '',
             syncSourceId: -1,
             infoMessage: '',
             electionTime: new Timestamp({ t: 1673970056, i: 1 }),
             electionDate: 2023-01-17T15:40:56.000Z,
             configVersion: 1,
             configTerm: 1,
             self: true,
             lastHeartbeatMessage: ''
           },
           {
             _id: 1,
             name: '127.0.0.1:42045',
             health: 1,
             state: 5,
             stateStr: 'STARTUP2',
             uptime: 0,
             optime: [Object],
             optimeDurable: [Object],
             optimeDate: 1970-01-01T00:00:00.000Z,
             optimeDurableDate: 1970-01-01T00:00:00.000Z,
             lastAppliedWallTime: 1970-01-01T00:00:00.000Z,
             lastDurableWallTime: 1970-01-01T00:00:00.000Z,
             lastHeartbeat: 2023-01-17T15:40:56.158Z,
             lastHeartbeatRecv: 2023-01-17T15:40:56.169Z,
             pingMs: 3,
             lastHeartbeatMessage: '',
             syncSourceHost: '',
             syncSourceId: -1,
             infoMessage: '',
             configVersion: 1,
             configTerm: 0
           }
         ],
         ok: 1,
         '$clusterTime': {
           clusterTime: new Timestamp({ t: 1673970056, i: 9 }),
           signature: {
             hash: new Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
             keyId: 0
           }
         },
         operationTime: new Timestamp({ t: 1673970056, i: 9 })
       }

3.) it seems like something you referring to, because then I see that mongoose.disconnect() is called which terminates the server... but right after that, I get now MongoPoolClearedError: Connection pool for 127.0.0.1:35717 was cleared because another operation failed with: "operation was interrupted" or MongoPoolClearedError: Connection pool for 127.0.0.1:42485 was cleared because another operation failed with: "not primary" or the initial timeout error

hasezoey commented 1 year ago

"operation was interrupted" "not primary"

seem to be related, like i said, no operations can be done as long as there is no primary, which, according to your logs, exists first then steps down and then later back up again. and when the state changes the operations on it are interrupted.

did you already try to change your replset count to be uneven (in your case the nearest uneven would be 3)?

simllll commented 1 year ago

replSet 3 didn't change anything, but now I was testing another approach; as your guess was that it could be related to the repl set state change, i thought,..well then let's wait till all is fine by adding following code to my mockMongoose method:

let allRunning = false;
do {
    const status = await mongooseConnection.connection.db
        .admin()
        .command({ replSetGetStatus: 1 });

    allRunning = status.members.some(
        member => member.stateStr === 'PRIMARY' || member.stateStr === 'SECONDARY'
    );
    console.log('allRunning', allRunning);
    if (!allRunning) {
        delay(100);
    }
} while (!allRunning);

and now it gets a bit weird: 1.) mongo starts up and I get the following connection uri: mongodb://127.0.0.1:41863,127.0.0.1:36727/?replicaSet=testset 2.) mongoose connects succesfully 3.) the above script should now verify if everything is up & running, but instead it throws immediately with MongoPoolClearedError: Connection pool for 127.0.0.1:41863 was cleared because another operation failed with: "operation was interrupted" or PoolClearedError [MongoPoolClearedError]: Connection pool for 127.0.0.1:46461 was cleared because another operation failed with: "not primary"

as a side note: this happens when there are two parallel tests. one with replset and one wihtout. it seems killing the single mongod instance (without the repl set) also kills the repl set?! The single instance is running on mongodb://127.0.0.1:36363/

hasezoey commented 1 year ago

it seems killing the single mongod instance (without the repl set) also kills the repl set?!

this shouldnt be the case, for more investigation i would need access to the start & stop code and hooks

also verify that the debug messages noting a stop are for the same port (which is noted in the [ and ])

assuming you dont try to use a replset's instance as a single server (replset.servers[0] for example)

PoolClearedError [MongoPoolClearedError]: Connection pool for 127.0.0.1:46461 was cleared because another operation failed with: "not primary"

probably (according to your initial log) this is because MMS waits until a first PRIMARY is found (via log output), where mongoose directly connects to, but then in the log immediately it changes to another state and so has no primary anymore for some time (in the log it was ~2 seconds)

but this not having the majority and failing the election should be mitigated by using a odd number of instances

simllll commented 1 year ago

the odd number unfortuantely didn't help, i tried it even with more.. e.g. like 5. Somehow I blame mongoose right now, because we have this "global" connection await mongoose.connect(uri) and not mongoose.createConnection(uri). let's assume mocha somehow shares the code, and if the one test succeeds a bit earlier, it calls disconnect in the after hook, but this is before the 2nd test has finished, ... it seems somehow killing the first connectino, also impacts the 2nd connection, maybe because of the global mongoose connection.. will need to investigate more to prove it, but I think the issue is not in the mongodb itself, but in the connection handling (either due to mongoose global connection, or something else that causes this in the mocha test run)

Update 1: removing the content of the disconnect() function, leads to something that kinda proves my last guess:

MongooseError: Can't call `openUri()` on an active connection with different connection strings. Make sure you aren't calling `mongoose.connect()` multiple times. See: https://mongoosejs.com/docs/connections.html#multiple_connections

somehow it reuses mongoose apparently...

simllll commented 1 year ago

Okay finally some progress, but still didn't found the real issue.

1.) somehow mongoose.disocnnect() && stop({ doCleanup: true }) does take more than 2 seconds in repl set setup, and this is longer than the 2 seconds default timeout of the "after" hook in mocha... so why is this a problem? because the hook is still processed, but very delayed, und then mongoose's global connection is disconnected while another job has already started running.. imagine following timeline:

test1 starts (before hook starts up mongodb and connects mongoose)
test1 runs
test1 tears down (after hook is running, but it is "canceled" after 2 seconds)..
test2 starts (before hook starts up.. etc)
test2 runs
now test1 tear down finally disconnects... and now test2 fails with weird error messages. depending on how long it takes, and which state the connection / test is currently, different error messages pop up.

this was at least one of the problems while trying to find the root cause.. i "easily" fixed this for now by increasing the default timeout.

2.) now the only error that is left in my test setup is the random "MongooseServerSelectionError: Server selection timed out after 30000 ms" .. yesterday evening it was popping up almost every run, today it takes more than 5 runs to get it.... I will provide a small project shortly that triggers the error.. maybe this helps to track it down.

one thing though:

i still have the allRunning code in there, so the cluster seem to be up and running fine!
the error of mongoose server seletion error popups up after the check, in this line:
```
await Model.create({
        _id: '55e48c00a23cfa794b83bb9c'
    });
```

simllll commented 1 year ago

I was trying to extract the issue and find a more reproducable script, but unfortuantely I'm not able to do so..it seems it depends somehow on the amount of models that are registered... What I was able though is to extract one exact run with all related log files that produces the error, see here: https://gist.github.com/simllll/baefac5d90ceedfc20c6a982132daa92

I believe the relevant log lines are these:

allRunning 2 true <-- this is console.log('allRunning', status.members.length, allRunning); .. which means 2 instances are up& running and each of them are either primary or secondary.
mongo setup started up
mockMongoose: 3.028s <-- time for spinning it up
beforeAll done
before
2023-01-18T18:51:36.429Z MongoMS:MongoInstance Mongo[43701]: stdoutHandler: ""{"t":{"$date":"2023-01-18T19:51:36.429+01:00"},"s":"I",  "c":"STORAGE",  "id":20320,   "ctx":"conn39","msg":"createCollection","attr":{"namespace":"test.storagefiles","uuidDisposition":"generated","uuid":{"uuid":{"$uuid":"f04db939-babf-49e1-895f-4393fb11a34d"}},"options":{}}}""
2023-01-18T18:51:36.430Z MongoMS:MongoInstance Mongo[43701]: stdoutHandler: ""{"t":{"$date":"2023-01-18T19:51:36.430+01:00"},"s":"I",  "c":"INDEX",    "id":20345,   "ctx":"conn39","msg":"Index build: done building","attr":{"buildUUID":null,"collectionUUID":{"uuid":{"$uuid":"f04db939-babf-49e1-895f-4393fb11a34d"}},"namespace":"test.storagefiles","index":"_id_","ident":"index-84--7882176241526893432","collectionIdent":"collection-83--7882176241526893432","commitTimestamp":null}}""
2023-01-18T18:51:36.486Z MongoMS:MongoInstance Mongo[36235]: stdoutHandler: ""{"t":{"$date":"2023-01-18T19:51:36.486+01:00"},"s":"I",  "c":"-",        "id":4939300, "ctx":"monitoring-keys-for-HMAC","msg":"Failed to refresh key cache","attr":{"error":"InitialSyncActive: Cannot refresh keys collection cache during initial sync","nextWakeupMillis":1000}}""
2023-01-18T18:51:36.721Z MongoMS:MongoInstance Mongo[43701]: stdoutHandler: ""{"t":{"$date":"2023-01-18T19:51:36.721+01:00"},"s":"I",  "c":"REPL",     "id":21216,   "ctx":"ReplCoord-1","msg":"Member is now in state DOWN","attr":{"hostAndPort":"127.0.0.1:36235","heartbeatMessage":"no response within election timeout period"}}
{"t":{"$date":"2023-01-18T19:51:36.721+01:00"},"s":"I",  "c":"REPL",     "id":21809,   "ctx":"ReplCoord-1","msg":"Can't see a majority of the set, relinquishing primary"}
{"t":{"$date":"2023-01-18T19:51:36.721+01:00"},"s":"I",  "c":"REPL",     "id":21475,   "ctx":"ReplCoord-1","msg":"Stepping down from primary in response to heartbeat"}"" <-- stepping down?! nooooo

the heartbeat seems to fail,.. maybe due to too high load? do you think it makes sense to increase the heartbeat timeout maybe?

hasezoey commented 1 year ago

the heartbeat seems to fail,.. maybe due to too high load? do you think it makes sense to increase the heartbeat timeout maybe?

i have no clue why it fails, but you can try to increase it, but this only causes a short ~2 seconds before the next primary is elected and should have allowed to be found via connections

i cannot find a particular failure point or guess why a timeout happens, unless out of some reason the timeout does not actually get cleared on disconnect, or if the context is reused that the connection somehow tries to connect to a server that is not active anymore

simllll commented 1 year ago

The only relevnat config I could think of is this one: https://www.mongodb.com/docs/manual/reference/replica-configuration/#mongodb-rsconf-rsconf.settings.heartbeatTimeoutSecs

but the default value is already 10 seconds..far away from the total time my instances are running (max 5 seconds)... any other config settings that you can suggest of tweaking?

hasezoey commented 1 year ago

any other config settings that you can suggest of tweaking?

no, i also dont really have any experience with those config options (never needed to use / overwrite them)

github-actions[bot] commented 1 year ago

Marking the Issue as stale, it will be closed in 7 days if no more activity is seen

github-actions[bot] commented 1 year ago

Closing the Issue because it is marked as stale

simllll commented 1 year ago

As the issue is closed, for the record, also our issue has been "magically" solved. Not sure why, but now over 4 weeks it didn't even happen once any more.