Notifier fails because resume point is no longer in oplog

clezag commented 1 year ago

After a restart the notifier fails with the following error:

> change-stream@1.0.0 start
> node src/main.js

connected to rabbitmq guest:guest@rabbitmq-0.rabbitmq-headless.default.svc.cluster.local:5672
connecting to db mongodb://mongodb-headless.default.svc.cluster.local:27017
{
  _id: new ObjectId("64d4e6259c5245a18d063d65"),
  timestamp: new Timestamp({ t: 1695130337, i: 1 }),
  date: 2023-09-19T13:32:17.000Z
}
Reading events starting from Tue Sep 19 2023 13:32:17 GMT+0000 (Coordinated Universal Time)
Now is Thu Oct 12 2023 13:26:15 GMT+0000 (Coordinated Universal Time)
node:events:491
      throw er; // Unhandled 'error' event
      ^

MongoServerError: PlanExecutor error during aggregation :: caused by :: Resume of change stream was not possible, as the resume point may no longer be in the oplog.
    at Connection.onMessage (/code/node_modules/mongodb/lib/cmap/connection.js:231:30)
    at MessageStream.<anonymous> (/code/node_modules/mongodb/lib/cmap/connection.js:61:60)
    at MessageStream.emit (node:events:513:28)
    at processIncomingData (/code/node_modules/mongodb/lib/cmap/message_stream.js:125:16)
    at MessageStream._write (/code/node_modules/mongodb/lib/cmap/message_stream.js:33:9)
    at writeOrBuffer (node:internal/streams/writable:392:12)
    at _write (node:internal/streams/writable:333:10)
    at Writable.write (node:internal/streams/writable:337:10)
    at Socket.ondata (node:internal/streams/readable:766:22)
    at Socket.emit (node:events:513:28)
Emitted 'error' event on ChangeStream instance at:
    at ChangeStream._closeEmitterModeWithError (/code/node_modules/mongodb/lib/change_stream.js:280:14)
    at ChangeStream._processErrorStreamMode (/code/node_modules/mongodb/lib/change_stream.js:348:18)
    at ReadableCursorStream.<anonymous> (/code/node_modules/mongodb/lib/change_stream.js:300:42)
    at ReadableCursorStream.emit (node:events:513:28)
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  ok: 0,
  code: 286,
  codeName: 'ChangeStreamHistoryLost',
  '$clusterTime': {
    clusterTime: Timestamp { low: 1, high: 1697117161, unsigned: true },
    signature: {
      hash: Binary {
        sub_type: 0,
        buffer: Buffer(20) [Uint8Array] [
          0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0
        ],
        position: 20
      },
      keyId: 0
    }
  },
  operationTime: Timestamp { low: 1, high: 1697117161, unsigned: true },
  [Symbol(errorLabels)]: Set(1) { 'NonResumableChangeStreamError' }
}

Node.js v18.16.1
npm notice
npm notice New major version of npm available! 9.5.1 -> 10.2.0
npm notice Changelog: <https://github.com/npm/cli/releases/tag/v10.2.0>
npm notice Run `npm install -g npm@10.2.0` to update!
npm notice

looks like the checkpoints are not saved correctly, as the 19.09 was probably the last time the notifier had been restarted

clezag commented 1 year ago

Workaround by manually setting the timestamp to current epoch in mongodb.

clezag commented 1 year ago

I've reactivated some logging to find out what happens. Looks like the checkpoint is not saved at all.

clezag commented 1 year ago

Potential fix is now up. After flushing the checkpoint the mutex was not reliably reset, which could in theory have prevented future updates.

clezag commented 1 year ago

occurred again, but this time it's related to a day long mongoDB outage, the checkpoints have been saved correctly where they could be.

In cases where the last checkpoint is older than the oldest oplog entry, it's probably right to just restart at the oldest available event. Or fail, but in any case not fail silently like it's doing now

clezag commented 10 months ago

I think this is fine how it is now.

The notifier fails with an error (pod restarts) when the resume point is no longer in the oplog. This happens for example if there is a longer outage (multiple hours to days) after which the notifier starts again

Instead of continuing silently, it's probably better to have this resolved by someone if it ever happens, otherwise events already out of the change stream are never pushed to the msg queue. Some script should be called that generates the missed events by other means (looking at the single databases/collections), after which the timestamp in the resume point can be updated and the notifier started again.

clezag commented 10 months ago

Verify oplog limits in mongodb and increase them if necessary

noi-techpark / infrastructure-v2

Notifier fails because resume point is no longer in oplog #12