Write up a summary of AOF, Replica and Multi-AZ failover and recommendations based on what I learn or at least talking points to why we would or wouldn't leverage these mechanisms.
Issues/Questions:
When looking through AWS documentation, I found some points worth mentioning:
Append-only files (AOF) are not supported for cache.t1.micro and cache.t2.* nodes. For nodes of these types, the appendonly parameter value is ignored.
For Multi-AZ replication groups, AOF is disabled.
AOF is not supported on Redis versions 2.8.22 and later.
Note: I have no idea how current this information is. redis 5.0 is out and this feels out of date?
When using replicas within a cluster, there is some replication delay where there may be some data loss on failover. That may be perfectly fine if redis is being used a pure caching layer, but not for our purposes.
What are "our purposes"? If we're not using redis as a pure caching level, how are we expecting to use it?
I'd rather have TQM detect that the redis node failed and figure out how to repopulate it using the TFE DB as the source of truth.
I would love to hear more about why we believe that a full failure is preferable to a potential partial loss of data? I understand that each of these resiliency measures have different pros and cons, but a full failure seems less ideal especially if we can use some of these (possibly multiple) to mitigate some portions of that full failure.
If we did rely on TMQ to detect a node failure, how would we spin up the new node? Going based off of the note from the backlog grooming meeting, it sounds as if this will start out as a manual process. Do we have any estimate of how often these failures happen and would require human intervention to provision a new node? Whose responsibility is this work?
General question: how did we decide on using redis?
Research:
RDB (Redis database backups) vs. AOF (append-only file)
RDB provides a snapshot of redis data, able to utilize such snapshots to more quickly recover with disaster resilience. Allows a quicker restart than with AOF. Can be sent to other data centers or even to s3 as its a single compact file.
RDB doesn't allow for consistent saving of data. If we take a snapshot at an interval of 1 hour, we could still lose data if a failure happened in between those RDB snapshots.
AOF feels much more flexible that RDB. You can create fsync policies down to the second or every time a query takes place. This means that we'd lose a second of data rather than any interval between RDB snapshots. Defaults fsync to 1 second intervals Redis can rewrite an AOF file if it starts to become too large, creating a new file while appending to the old one until the new one is up and running. AOF keeps a log of all the operations one after the other, can be exported or easily digested.
the AOF persistence logs every write operation received by the server, that will be played again at server startup, reconstructing the original dataset.
AOF can be slower than RDB (performance is still high with an fysnc policy of 1 second, should be comparable to RDB if there is not fsync policy set).
While both have pros and cons, I have the understanding that using these two resiliency policies together could give more overall coverage and cut down on data lost or time it takes to restart redis
AOF cannot protect against all failure scenarios. For example, if a node fails due to a hardware fault in an underlying physical server, ElastiCache will provision a new node on a different server. In this case, the AOF file will no longer be available and cannot be used to recover the data. Thus, Redis will restart with a cold cache.
Replicas
If running the Redis engine, you can group 2 to 6 nodes into a cluster with replicas where 1 to 5 read-only nodes contain replicate data of the group's single read/write primary node.
If we had multiple nodes, one or more of the other nodes (other than the primary read/write node) would be able to mitigate data loss from the primary node failure since the data has been replicated to another node in the cluster.
Data can still be lost when the primary read/write node fails.
You can use Redis (cluster mode disabled) clusters with replica nodes to scale your Redis solution for ElastiCache to handle applications that are read-intensive or to support large numbers of clients that simultaneously read from the same cluster.
All of the nodes in a Redis (cluster mode disabled) cluster must reside in the same region. To improve fault tolerance, you can provision read replicas in multiple Availability Zones within that region.
A Redis cluster (cluster mode disabled) has a single shard, encompassing multiple Redis nodes: one primary read/write, with the option of adding up to 5 read replicas.
Each read replica maintains a copy of the data from the cluster's primary node.
When you add a read replica to a cluster, all of the data from the primary is copied to the new node. From that point on, whenever data is written to the primary, the changes are asynchronously propagated to all the read replicas.
Multi-AZ with automatic failover
If a primary node or Availability Zone fail, Elasticache for Redis can detect the failure of the primary node and fails over to a read replica.
Multi-AZ with automatic failover is only available in redis clusters where replicas are enabled.
From failure to promotion, failover typically completes within sixty seconds. This process is much faster than recreating and provisioning a new primary, which is the process if you don't enable Multi-AZ with automatic failover.
There's a possibility that lag becomes worse when multi-AZ automatic failover is enabled, which is a concern about replicas in general. This could exacerbate the issue.
ElastiCache for Redis Multi-AZ with Automatic Failover and append-only file (AOF) are mutually exclusive. If you enable one, you cannot enable the other.
For greater reliability and faster recovery, we recommend that you create one or more read replicas in different Availability Zones for your cluster, and enable Multi-AZ on the replication group instead of using AOF.
If we do go with Sander's route of having a redis cluster (cluster mode disabled) with 1-5 read replicas, would we keep them all in the same availability zone?
Thoughts:
I can understand the desire to stick to one shard for a variety of reasons:
We have not reached peak capacity for this elasticache redis instance. When tested, a much smaller cache node and did not encounter any issues.
Sharding seems unnecessary for the current situation we'd be using it for, space-wise and maintenance-wise.
Sharding may can introduce more steps and places for information drift to occur. It can be difficult to do sharding efficiently, and I think that this particular issue can be solved much more cleanly with solely one shard for those redis nodes.
I may not have enough context to understand why a full failure detected by TQM is preferable to a partial failure with the potential to lose smaller windows of data.
If we do utilize one/some of these resiliency measures, how do we track what data if any was lost during a primary node failure?
Is introducing partial failure coverage worth the work to keep track of the data that's lost and attempting to remedy that/communicate with the customer/possible support tickets?
If we do elect to allow full failures, what are the consequences of that? If a redis node is down for ~1 hour, what are the effects of that?
How are we planning on monitoring the health of the primary redis node if we don't use automatic failover or replicas? *Not saying we should use these things, but we'll have to have monitoring or health checks in place to confirm that the node is healthy. I believe robert noted that TQM would acknowledge a failure. Is this something we'd have to build in addition to the new elasticache instance?
Many of these resiliency features do not work together.
You cannot use AOF and multi-AZ automatic failover together. When one is enabled, you cannot use the other.
Replicas and multi-AZ automatic failover can be used together, but we (as a greater team) need to decide whether or not using replicas is the route we want to take. Replicas can cause latency with larger data sets and adding mutli-AZ automatic failover could exacerbate that latency. We can lose data still if the primary read/write node fails. We'd need a way to keep track of the timeframe/data that we lose.
Depending on what version of redis we use, we could possibly not even be able to use AOF from the get-go.
While using some in tandem could create a more protected situation in case of a failure, I'm not sure how effective partial failure coverage can be.
What are the criteria for success and what is unacceptable (data loss, how much, certain windows of time)?
I think once we answer mainly these questions, the solution we should use will become clearer.
Where I left off yesterday:
Daily To-Do List:
Look at this Asana https://app.asana.com/0/974609005970545/1112250927640064/f
Research and understand:
Redis Docs: https://redis.io/
AWS Docs: https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/FaultTolerance.html https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoFailover.html https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Replication.html
Terraform docs: https://www.terraform.io/docs/providers/aws/r/elasticache_cluster.html
Write up a summary of AOF, Replica and Multi-AZ failover and recommendations based on what I learn or at least talking points to why we would or wouldn't leverage these mechanisms.
Issues/Questions:
Research:
Replicas
If running the Redis engine, you can group 2 to 6 nodes into a cluster with replicas where 1 to 5 read-only nodes contain replicate data of the group's single read/write primary node.
If we had multiple nodes, one or more of the other nodes (other than the primary read/write node) would be able to mitigate data loss from the primary node failure since the data has been replicated to another node in the cluster.
Data can still be lost when the primary read/write node fails.
A Redis cluster (cluster mode disabled) has a single shard, encompassing multiple Redis nodes: one primary read/write, with the option of adding up to 5 read replicas.
Thoughts: