Closed clezag closed 1 week ago
Workaround by manually setting the timestamp to current epoch in mongodb.
I've reactivated some logging to find out what happens. Looks like the checkpoint is not saved at all.
Potential fix is now up. After flushing the checkpoint the mutex was not reliably reset, which could in theory have prevented future updates.
occurred again, but this time it's related to a day long mongoDB outage, the checkpoints have been saved correctly where they could be.
In cases where the last checkpoint is older than the oldest oplog entry, it's probably right to just restart at the oldest available event. Or fail, but in any case not fail silently like it's doing now
I think this is fine how it is now.
The notifier fails with an error (pod restarts) when the resume point is no longer in the oplog. This happens for example if there is a longer outage (multiple hours to days) after which the notifier starts again
Instead of continuing silently, it's probably better to have this resolved by someone if it ever happens, otherwise events already out of the change stream are never pushed to the msg queue. Some script should be called that generates the missed events by other means (looking at the single databases/collections), after which the timestamp in the resume point can be updated and the notifier started again.
Verify oplog limits in mongodb and increase them if necessary
After a restart the notifier fails with the following error:
looks like the checkpoints are not saved correctly, as the 19.09 was probably the last time the notifier had been restarted