There are a number of different classes of services that folks run that want to be HA. This runs the gamut and generally includes things like databases (e.g. MySQL, Cockroach, Postgres, MSSQL), things that have built-in consensus (raft, zookeeper, etc.).
Most of these boil down into saying I would like to ensure that all of the instances of this thing don't end up in the same failure domain. For most smaller deployments (e.g. a single rack) this failure domain is probably the individual compute sled. For a multi-rack scenario, the failure mode may be larger and propagate to a rack-level or even cell-level.
This is to track that we want to think about what this means for Nexus, taking into consideration what others have done, and eventually write up an RFD on this, while taking into account a multi-rack future.
There are a number of different classes of services that folks run that want to be HA. This runs the gamut and generally includes things like databases (e.g. MySQL, Cockroach, Postgres, MSSQL), things that have built-in consensus (raft, zookeeper, etc.).
Most of these boil down into saying I would like to ensure that all of the instances of this thing don't end up in the same failure domain. For most smaller deployments (e.g. a single rack) this failure domain is probably the individual compute sled. For a multi-rack scenario, the failure mode may be larger and propagate to a rack-level or even cell-level.
This is to track that we want to think about what this means for Nexus, taking into consideration what others have done, and eventually write up an RFD on this, while taking into account a multi-rack future.