linagora / tmail-backend

GNU Affero General Public License v3.0
42 stars 22 forks source link

Chaos testing: break down Redis #1013

Open quantranhong1999 opened 7 months ago

quantranhong1999 commented 7 months ago

Why

Expectation: TMail core service should not be disrupted more or less by Redis outage.

How

Experiment on preprod with what happens to TMail deployment if Redis is down.

Some related Redis features:

Identify issues and propose enhancements to help TMail deployment be more fault tolerant and resilient.

quantranhong1999 commented 7 months ago

cc @chibenwa

vttranlina commented 7 months ago

Local test result:

Feature Status
Rate limiting :warning:
Rspamd :green_heart:
SSO via Apisix :red_circle:
Jmap/ Redis event bus :red_circle:

Rate limiting and Rspamd mailet

It should be noted that we have already declared <onMailetException>ignore</onMailetException> (in mailetcontainer.xml file), so exceptions within the mailet do not disrupt the entire mailet pipeline flow.

Rate limiting

The PerRecipientRateLimit and PerSenderRateLimit mailets were waiting for a response from Redis. I have not yet looked up the default timeout in the ratelimitj library. I observed that it was simply waiting, and after a few minutes, I "un-paused" Redis, then the mailet continued and executed successfully (without any exceptions logged).

I attempted to modify the source code to override the timeout by Reactor (ex: set to 10 seconds). In this case, the mailet threw an exception, but it was ignored, and the next mailet in the pipeline continued execution. The recipient received an email from the sender successfully.

// A warning flag has been set for this feature because we should configurable the timeout exception for this mailet

Rspamd

The RspamdScanner mailet behaves slightly differently. After a few seconds (faster than rateLimit), the RspamdScanner automatically finishes processing. I checked the logs from Rspamd and found the message cannot get ANNs list from redis: timeout while connecting the server, which was marked normal log by rpsamd. There were no errors or disruptions.

SSO via Apisix

It is not possible to log in or log out via SSO. The tmail-apisix-plugin-runner plugin is responsible for this issue. The process of checking whether the "token" has been revoked or not (by querying Redis) causes the process to hang and eventually timeout.

Error occurred at: com.linagora.apisix.plugin.RedisRevokedTokenRepository.exist(RedisRevokedTokenRepository.java:24)

Jmap/ Redis event bus

It is not possible to receive a response from Jmap methods: Email/send, Email/set + EmailSubmission/set (methods using queue). The client waits for a response for several minutes (the exact duration of the wait leading to an error response from the server is unknown; I waited for more than 5 minutes, but it was still waiting, so I stopped it).

Another related exception:

2024-04-24T10:03:58.401397480Z io.netty.handler.timeout.ReadTimeoutException: null
2024-04-24T10:03:58.401663224Z 10:03:58.401 [ERROR] o.a.j.e.GroupConsumerRetry - Exception happens when handling event after 1 retries

Last: we can not start a new Tmail (or restart) when stop redis.

chibenwa commented 7 months ago

Just to be sure, did you relied on a redis cluster for those tests ? Or did you work on a single container?

By using redis as a pub-sub component for Apache James, then getting some level of reliance of Redis is IMO acceptable, it would be Ok to tolerate failures / depoendency to Redis for that pub sub component.

Rate limiting and Rspamd mailet

// A warning flag has been set for this feature because we should configurable the timeout exception for this mailet

+1 for the timeout in the RateLimiting mailets configuration, none by default.

I do not understand why lettuce driver do not handle the timeout itself. They documented a default timeout to 60 seconds. We need to understand why it is not the case IMO.

We could also configure by default this mailet to ignore exceptions...

SSO via Apisix

Tested with a redis cluster?

We might want to add a parameter ignoreRedisErrors defaulting to false. If turned to true, if redis fails we ignore the logout flow, effectively preserving the service at a cost of a bit of security. This seems like a valuable tool to have in the tool box for bad day situations...

Jmap/ Redis event bus

If operating on top of a lone redis, then failure at the JMAP level seems ok to me at first glance.

However failing to clearly timeout IS an issue. If Redis is KO those JMAP requests should fail fast, in the 5s range IMO.

Last: we can not start a new Tmail (or restart) when stop redis.

That's indeed a problem: we shall be able to do reboot TMail (when not using Redis for pub sub).

If using Redis for pub sub then failing starting James would be acceptable if redis is down...

Thoughts?

vttranlina commented 7 months ago

I used the single redis container for test

The redis cluster (master-slave) on staging k8s is enough for what we want?

vttranlina commented 7 months ago

I checked the staging, topology is 1 master + 2 replicas. the tmail configuration uses redis master's endpoint. => Testing with it no difference with single node container

quantranhong1999 commented 7 months ago

I checked the staging, topology is 1 master + 2 replicas. the tmail configuration uses redis master's endpoint.

A bit more explanation on that. Before the Redis event bus key work, we configure the Redis endpoint to the Redis service endpoint (K8s can load balance to either master or slave). After the Redis event bus key work, some related PUBSUB commands need to execute against the write-able node like the master. So recently I changed directly the Redis endpoint to the master node.

Sounds not good actually. Alternatives I think:

Arsnael commented 7 months ago

Redis Cluster (multi-master)

Now that I think about it again, wasn't there an issue using the redis cluster with one of our component? Maybe apisix?

chibenwa commented 7 months ago

The redis cluster (master-slave) on staging k8s is enough for what we want?

No ideally redis-cluster cluster cluster should be used for testing.

IMO redis topology shall be...

vttranlina commented 7 months ago

One redis cluster for all our use cases

How many master in redis-cluster?

chibenwa commented 7 months ago

3 node cluster

vttranlina commented 7 months ago

Redis-cluster lab (local)

Docker-compose lab:

1. redis cluster: 3 node master, 0 node replicas

(a requirement for building a cluster is to have a minimum of 3 master nodes): Example before any node go down:

When node1 goes down:

2. redis cluster: 3 node master, 3 node replicas

Scenario sample:

Node1 (master) - Node4 (replica)
Node2 (master) - Node5 (replica)
Node3 (master) - Node6 (replica)

During this time, monitoring the Redis logs, there will be logs like:

Cluster state changed: fail
....
Cluster state changed: ok

Tmail-backend and Redis cluster

Rspamd

Rate limiting, Jmap/ Redis event bus

{
    "sessionState": "2c9f1b12-b35a-43e6-9af2-0106fb53a943",
    "methodResponses": [
        [
            "Email/send",
            {
                "accountId": "b0d9e55c1a2682586469bc2a23dbb2c671e138ee61e0362972fd7c3d265ea9b2",
                "newState": "2c9f1b12-b35a-43e6-9af2-0106fb53a943",
                "notCreated": {
                    "K87": {
                        "type": "serverFail",
                        "description": "CLUSTERDOWN The cluster is down"
                    }
                }
            },
            "c1"
        ]
    ]
}

Related error log regarding the Lettuce library:

io.lettuce.core.RedisCommandExecutionException: CLUSTERDOWN The cluster is down

Warning log when starting TMail:

06:17:33.031 [WARN ] i.l.c.c.t.DefaultClusterTopologyRefresh - Unable to connect to [redis1/<unresolved>:6379]: Connection initialization timed out after 1 minute(s)
2024-04-26T06:17:33.031475228Z io.lettuce.core.RedisCommandTimeoutException: Connection initialization timed out after 1 minute(s)
PeriodicalHealthChecks - DEGRADED: Redis: Can not connect to Redis.

Another note:

quantranhong1999 commented 7 months ago

Interesting experiment.

Within the first 1-60 seconds: If Client-Side Sharding routes data to down-node => waiting. If Client-Side Sharding routes to up-node => response created normally.

So Can TMail recover reconnecting to the Redis Cluster after the Redis Cluster is backed normally?

A new error related to using Redis Cluster, even when the Redis Cluster is up and running normally, is the RedisHealthCheck error:

I can not understand this. The RedisHealthCheck is supposed to create a new connection for every check which should be acked about the Redis Cluster healthy again.

The current "," character for separate redis nodes in redis.properties does not work. Don't know why parsing it accepts only the first URL, I tried to replace with ; and update the RedisUris.from, then it worked normally

Dont forget to fire a fix for it ^^

Arsnael commented 7 months ago

So what I understand is that we can't use redis cluster with rspamd, correct? Same for sentinel I would guess then if you can only point one endpoint?

Or maybe the headless endpoint with k8s that redirects to all redis pods addresses would do the trick?

quantranhong1999 commented 7 months ago

Same for sentinel I would guess then if you can only point one endpoint?

Redis does support Redis Sentinel: https://rspamd.com/doc/configuration/redis.html

I am unsure about Redis Cluster as I do not see Rspamd mentions.

chibenwa commented 7 months ago

I am unsure about Redis Cluster as I do not see Rspamd mentions.

I remeber unsupported as it lacked some REDIS commands.

chibenwa commented 7 months ago

Some summary:

chibenwa commented 7 months ago

Some questions:

vttranlina commented 7 months ago

What configration parameter can be used to trigger the fallback?

Open redis-cli then command CLUSTER FAILOVER FORCE on replicas node ref: https://redis.io/docs/latest/commands/cluster-failover/

Shall we lower the trigger to say 10 seconds?

my opinion: lower is better The default configuration is 15 seconds Ref: https://raw.githubusercontent.com/redis/redis/7.2/redis.conf

I really wonder if we shall not ignore failures upon key dispatch. It would not be that bad. We could make this configurable into redis conf? Because loosing the ability to send email if redis is down do not seem like a nice property to me!

+1

The key dispatch by Redis for the notification feature is not critical,

chibenwa commented 7 months ago

Shall we lower the trigger to say 10 seconds?

WHat is the impact of false positive ie you fallback when there is nothing?

vttranlina commented 7 months ago

WHat is the impact of false positive ie you fallback when there is nothing?

Even when the master node is down, or not if we run command CLUSTER FAILOVER FORCE on replica node, then replica will "force" to the master immediately. Old master -> replicas

chibenwa commented 7 months ago

WHat is the impact of false positive ie you fallback when there is nothing?

That was not the question.

Upon a master slave failover...

... do we loose unreplicated data?

... How long does the failover takes?

... Are there other side effects?

Based on these answer we might want to put a low value, or a defensive value to prevent too-frequent switches...

vttranlina commented 7 months ago

Upon a master slave failover...

... do we loose unreplicated data?

yes,

Example cases:

  1. Data Loss due to Asynchronous Replication:

    • client C writes data1 to master A
    • A "ACKED" to C
    • A was not yet replicated data to replicas node A1 (async way)
    • As a result, replica node A1 is promoted to master, and the data (data1) which was not yet replicated to A1 is lost.
  2. Data Loss due to Partition:

    • Cluster has a partition issue. Site1: masterA + client C. Site2: master B,C, Replicas A1,B2,C1
    • During the partition (before node_timeout is detected), client C successfully writes data2 to master node A.
    • When the partition issue is resolved, the data (data2) written by client C is lost.

// the Redis document write that does not support strong consistency

... How long does the failover takes?

config: cluster-node-timeout + buffer 2 seconds

Redis document:

the cluster becomes available again after NODE_TIMEOUT time plus a few more seconds required for a replica to get elected and failover its master (failovers are usually executed in a matter of 1 or 2 seconds).

Related to

Shall we lower the trigger to say 10 seconds?

Updated answer: A longer duration would be preferable in the event of a partition issue where the client is situated on the same site as the failed master node. This ensures that data loss does not occur when the issue is resolved. For the case "Asynchronous Replication" above, it also can help the lost data that was not yet replicated, but the trade-off is long downtime