Closed ongardie closed 9 years ago
I agree any non-human controlled recovery that isn't based on perfect knowledge of the state of the infrastructure (AWS/GCE/OpenStack/etc APIs) is doomed to cause accidental data loss.
Now, etcd will protect you from cluster misconfigurations by blocking RPCs from the old cluster but it is really a rather unsafe operation for most use cases.
Yes, this indeed breaks your safety properties when a majority of the cluster is lost, for increased availability. The trade-off is documented in the administration guide. I agree that this is undesirable for some production uses of etcd, and I may make this default to off before moving it out of alpha state.
Reasons for it to be on, at least for the time being:
Can you mark it clearly in the docs as a dangerous and destructive operation. I would argue that under many use cases the data is not recomputable or not easily recomputable. And even if it is many pieces of software are not regularly tested for "time travel".
I agree that this should be more clearly documented! re: time travel https://github.com/coreos/etcd/issues/3879 :P
@spacejam sure, this is part of the read semantics of etcd. Having writes time travel is way more dangerous.
It's pretty application specific :) Feel free to submit a pr for the docs!
Hey, I just stumbled across the project, and I'm concerned about the automatic re-seeding. If you can't access a majority of an etcd/raft cluster, you don't know if you've lost committed data. Rolling with it breaks the promise that etcd/raft makes to its clients, so automatically trying to recover seems like it could cause a lot of trouble. Making this the default behavior is even worse.
Do you have evidence that automating this is even necessary in practice? I like automated systems too, but I'd rather get human approval any time I'm admitting possible data loss.
cc @philips