Document and implement a snapshot (backup) strategy

andrewjstone commented 7 years ago

The backend used by haret, vertree, supports snapshots. However, haret don't use them yet. There are a few strategies that we have discussed and that should probably be implemented.

Snapshot every N writes
Snapshot on admin user command
Snapshot every N seconds
Snapshot as fast as possible. As soon as the last snapshot completes, start another one.

We also need to decide:

How snapshotting is configured (config file, vs runtime, etc...)
What sort of retention policy for old snapshots we want (keep last 10, expire past 30 days, etc...)

jrgarcia commented 7 years ago

I thought about this for a little bit and I think the best way to configure this would be through the configuration file. It wouldn't even look that bad given TOML. You could have a snapshot table or even an array of tables to configure multiple strategies. Being able to pass in configuration from the CLI would be nice, but I'm not sure it's desirable, at least initially.

I think those strategies take care of what almost everyone would use. Other than conditional snapshotting, I can't think of anything else. Conditional snapshotting brings in a whole other set of problems though.

andrewjstone commented 7 years ago

I'm still not sure I want to add anything to the config file. Originally the goal for that file was only to allow bootstrapping of the node. All other configuration should be able to be set at runtime. That however, presents the problem that in order to save that runtime config it still needs to go somewhere. So yes, we could read and write to the config file, but that happens per node. Another alternative is to use a root consensus group for configuring these things. That way each node shares the configuration. The root consensus group will probably end up being created and used for creating namespaces also, to prevent collisions/race conditions if the same namespace is created on different nodes.

I'm actually not sure that we'll even need that much of a snapshot configuration like this anymore either. We are going to be adding a WAL that will be used log all operations, so snapshotting only needs to happen when the WAL gets garbage collected. Tuning that is going to be important, but it's tied together with logging so it's no longer just a 'snapshot' policy. I believe this issue was opened when we weren't thinking about logging to disk and just using the in mem log.

jrgarcia commented 7 years ago

👍 makes sense

evanmcc commented 7 years ago

I think the comment about the snapshot config makes sense. I worry about runtime only config, though. One of the hardest things to explain to customers about riak was that in addition to the config file, there was also runtime config which could change things quite a bit, which couldn't really be inspected easily and would be propagated to new nodes when they joined.

IMO the gold standard here is something like irssi or weechat. Both programs allow the user to set various settings either in their config file or at runtime, and either reload the config file or write configspace out to the file (ignoring defaults).

I think that I would go a step further, and have the configuration file be the backend for local and global config changes at runtime (using a very conservative model so that a node doesn't change its config, crash, then wedge itself). Ideally all of the config would be 'in one place' to make for easy inspection and learning. Riak was littered with problematic application:get_env(riak_app, variable_youve_never_heard_of, hastily_chosen_default) calls.

Anyway, I have a lot of feelings here without concrete plans, but: making configuration obvious, inspectable, and discoverable seems to me to be crucially important.

vmware-archive / haret

Document and implement a snapshot (backup) strategy #70