Open mdutoo opened 9 years ago
I would like @tbroyer 's feedback regarding "tell IPGarde not to snapshot primary anymore". As discussed by email, it's fine with me.
+cc @jpoittevin
I also think it's a good idea to backup a secondary node. I don't really know the snapshot technology used by IPGarde to do the veeam snapshots but it probably induce a, even slight, overhead.
perfect everyone agrees!
@mdutoo let say the primary M1 really goes down and the (backed-up) secondary M2 becomes primary. If we're not notified that the primary M1 is down, during the next M2 back-up, M3 could be elected as primary but is in the other data center.
So I would prefer to have the alert on M1 before asking IPGarde to remove the backup.
could the backup process launch a pre-script to step down the node to backup a few seconds before the backup to ensure the node supposed to be the primary is really the primary during the backup process ?
@silently I'll let you ask @jpoittevin 's idea to IPGarde. But even if it's possible, backup would fail if the to be backupped node is down. Which is solved by improving pre-script idea such: let's backup all node, but abort it before it starts if the pre-script says that it is primary. So as long as there's at least a secondary, we'll have a backup.
But if that can't be done, merely backupping a given secondary (or both) is not worse than today, where primary is backupped everytime (not only in some error cases). And since most changes of primary are likely caused by VM snapshotting-induced network micro-cuts, that would even be much better.
And all these solutions would again be improved if we're alerted of changes of primary, by allowing us to change it back, in addition to do error analysis. As I've said, it can be done by MMS.
mmmhh, I also suggest to leave nodes up :P
Ok I am going to ask IPgarde about the script thing, but as I've said, I think having the monitoring (MMS or whatever) is important before establishing conditional backups.
In light of recent issues on the Kernel, and absence of a dedicated and skilled (no offense intended) ops team, may I bring back the idea of using a PaaS? (more details on the Kernel issues by mail later today)
added to the agenda of next Monday's meeting
for the follow-up the machine holding the primary is not backed up anymore, we still have needs regarding monitoring/alerting.
Also makes puppet deployment on the original primary fail, since it expects it to still be primary, though the deployment succeeds and is only prevented to start again. So I've just rs.stepDown()'d the errroneous primary so that puppet deployment is OK this time.
The best way is probably to always snapshot only the same replica and prevent it from every becoming master this way: https://docs.mongodb.com/v2.6/tutorial/configure-secondary-only-replica-set-member/
This snapshotted replica should probably be the Bonneville one, because it is a good thing to prevent it from becoming master, because if it happened it would slow down the whole replica set, being the only Datacore replica in its Datacenter.
(Brought back by @silently on 20150323)
Can be checked in mongo client by doing rs.conf().
This is not a problem for the Datacore, since its Java mongodb driver will handle failover and use the right new primary: http://stackoverflow.com/questions/21841064/mongodb-java-client-automatic-failover-failing http://docs.mongodb.org/ecosystem/drivers/java-replica-set-semantics/
Though it can cause problems in clients that don't support failover or are badly configured, such as when using robomongo to list collections:
Quick solution if you want to revert it anyway: on new primary in mongo client do rs.stepDown(). Usually the "right" primary will be elected, probably because the only other node is farther away (BV).
The probable cause is taking VM snapshots to backup data causes micro-cuts that trigger a new primary election, however VM snapshots should not be done on primary (done for now, confirmed by @silently).
TODO
(this is a good backup method, however long term speaking, we'll still have to think about (probably adding) a better one ex. MMS...)