Open quantranhong1999 opened 7 months ago
cc @chibenwa
Feature | Status |
---|---|
Rate limiting | :warning: |
Rspamd | :green_heart: |
SSO via Apisix | :red_circle: |
Jmap/ Redis event bus | :red_circle: |
It should be noted that we have already declared <onMailetException>ignore</onMailetException>
(in mailetcontainer.xml file), so exceptions within the mailet do not disrupt the entire mailet pipeline flow.
The PerRecipientRateLimit
and PerSenderRateLimit
mailets were waiting for a response from Redis. I have not yet looked up the default timeout in the ratelimitj
library. I observed that it was simply waiting, and after a few minutes, I "un-paused" Redis, then the mailet continued and executed successfully (without any exceptions logged).
I attempted to modify the source code to override the timeout by Reactor (ex: set to 10 seconds). In this case, the mailet threw an exception, but it was ignored, and the next mailet in the pipeline continued execution. The recipient received an email from the sender successfully.
// A warning flag has been set for this feature because we should configurable the timeout exception for this mailet
The RspamdScanner
mailet behaves slightly differently. After a few seconds (faster than rateLimit), the RspamdScanner automatically finishes processing. I checked the logs from Rspamd and found the message cannot get ANNs list from redis: timeout while connecting the server
, which was marked normal log by rpsamd. There were no errors or disruptions.
It is not possible to log in or log out via SSO. The tmail-apisix-plugin-runner
plugin is responsible for this issue. The process of checking whether the "token" has been revoked or not (by querying Redis) causes the process to hang and eventually timeout.
Error occurred at: com.linagora.apisix.plugin.RedisRevokedTokenRepository.exist(RedisRevokedTokenRepository.java:24)
It is not possible to receive a response from Jmap methods: Email/send
, Email/set + EmailSubmission/set
(methods using queue). The client waits for a response for several minutes (the exact duration of the wait leading to an error response from the server is unknown; I waited for more than 5 minutes, but it was still waiting, so I stopped it).
Another related exception:
2024-04-24T10:03:58.401397480Z io.netty.handler.timeout.ReadTimeoutException: null
2024-04-24T10:03:58.401663224Z 10:03:58.401 [ERROR] o.a.j.e.GroupConsumerRetry - Exception happens when handling event after 1 retries
Last: we can not start a new Tmail (or restart) when stop redis.
Just to be sure, did you relied on a redis cluster for those tests ? Or did you work on a single container?
By using redis as a pub-sub component for Apache James, then getting some level of reliance of Redis is IMO acceptable, it would be Ok to tolerate failures / depoendency to Redis for that pub sub component.
Rate limiting and Rspamd mailet
// A warning flag has been set for this feature because we should configurable the timeout exception for this mailet
+1 for the timeout in the RateLimiting mailets configuration, none by default.
I do not understand why lettuce driver do not handle the timeout itself. They documented a default timeout to 60 seconds. We need to understand why it is not the case IMO.
We could also configure by default this mailet to ignore exceptions...
SSO via Apisix
Tested with a redis cluster?
We might want to add a parameter ignoreRedisErrors
defaulting to false. If turned to true, if redis fails we ignore the logout flow, effectively preserving the service at a cost of a bit of security.
This seems like a valuable tool to have in the tool box for bad day situations...
Jmap/ Redis event bus
If operating on top of a lone redis, then failure at the JMAP level seems ok to me at first glance.
However failing to clearly timeout IS an issue. If Redis is KO those JMAP requests should fail fast, in the 5s range IMO.
Last: we can not start a new Tmail (or restart) when stop redis.
That's indeed a problem: we shall be able to do reboot TMail (when not using Redis for pub sub).
If using Redis for pub sub then failing starting James would be acceptable if redis is down...
Thoughts?
I used the single redis container for test
The redis cluster (master-slave) on staging k8s is enough for what we want?
I checked the staging, topology is 1 master + 2 replicas. the tmail configuration uses redis master's endpoint. => Testing with it no difference with single node container
I checked the staging, topology is 1 master + 2 replicas. the tmail configuration uses redis master's endpoint.
A bit more explanation on that. Before the Redis event bus key work, we configure the Redis endpoint to the Redis service endpoint (K8s can load balance to either master or slave). After the Redis event bus key work, some related PUBSUB commands need to execute against the write-able node like the master. So recently I changed directly the Redis endpoint to the master node.
Sounds not good actually. Alternatives I think:
Redis Cluster (multi-master)
Now that I think about it again, wasn't there an issue using the redis cluster with one of our component? Maybe apisix?
The redis cluster (master-slave) on staging k8s is enough for what we want?
No ideally redis-cluster cluster cluster should be used for testing.
IMO redis topology shall be...
One redis cluster for all our use cases
How many master in redis-cluster?
3 node cluster
Docker-compose lab:
Commit: https://github.com/linagora/tmail-backend/commit/432a0e5c2a6028e69ac52d8718456610b405d3e6
Branch: https://github.com/vttranlina/tmail-backend/tree/testRedis
Testing with redis-cluster can lead to various scenarios, so before describing them, I'll make a few remarks:
We need to pay attention to the cluster-node-timeout
parameter in the configuration file when starting redis-cluster.
For example: cluster-node-timeout = 60000
, meaning when a node in the cluster goes down, it takes up to 60 seconds for the remaining nodes to confirm that the entire cluster is down. Within 1-60 seconds after node 1 goes down, the remaining nodes still have a normal status.
(a requirement for building a cluster is to have a minimum of 3 master nodes): Example before any node go down:
key1
is stored on node1
key2
-> node2
key3
on node3
.When node1
goes down:
key1
(waiting).key2
and key3
normally.key4
:
Scenario sample:
Node1 (master) - Node4 (replica)
Node2 (master) - Node5 (replica)
Node3 (master) - Node6 (replica)
During this time, monitoring the Redis logs, there will be logs like:
Cluster state changed: fail
....
Cluster state changed: ok
When one node goes down:
Client-Side Sharding
routes data to down-node => waiting.Client-Side Sharding
routes to up-node => response created normally.In case JMAP method (Email/send, EmailSubmission/set), when cluster confirmed "fail" status: The response returns immediately:
{
"sessionState": "2c9f1b12-b35a-43e6-9af2-0106fb53a943",
"methodResponses": [
[
"Email/send",
{
"accountId": "b0d9e55c1a2682586469bc2a23dbb2c671e138ee61e0362972fd7c3d265ea9b2",
"newState": "2c9f1b12-b35a-43e6-9af2-0106fb53a943",
"notCreated": {
"K87": {
"type": "serverFail",
"description": "CLUSTERDOWN The cluster is down"
}
}
},
"c1"
]
]
}
Related error log regarding the Lettuce library:
io.lettuce.core.RedisCommandExecutionException: CLUSTERDOWN The cluster is down
Warning log when starting TMail:
06:17:33.031 [WARN ] i.l.c.c.t.DefaultClusterTopologyRefresh - Unable to connect to [redis1/<unresolved>:6379]: Connection initialization timed out after 1 minute(s)
2024-04-26T06:17:33.031475228Z io.lettuce.core.RedisCommandTimeoutException: Connection initialization timed out after 1 minute(s)
PeriodicalHealthChecks - DEGRADED: Redis: Can not connect to Redis.
Another note:
timeout
(tmail site) is suitable with cluster-node-timeout
configured in redis site. redis.properties
does not work. Don't know why parsing it accepts only the first URL, I tried to replace with ;
and update the RedisUris.from
, then it worked normally Interesting experiment.
Within the first 1-60 seconds: If Client-Side Sharding routes data to down-node => waiting. If Client-Side Sharding routes to up-node => response created normally.
So Can TMail recover reconnecting to the Redis Cluster after the Redis Cluster is backed normally?
A new error related to using Redis Cluster, even when the Redis Cluster is up and running normally, is the RedisHealthCheck error:
I can not understand this. The RedisHealthCheck
is supposed to create a new connection for every check which should be acked about the Redis Cluster healthy again.
The current "," character for separate redis nodes in redis.properties does not work. Don't know why parsing it accepts only the first URL, I tried to replace with ; and update the RedisUris.from, then it worked normally
Dont forget to fire a fix for it ^^
So what I understand is that we can't use redis cluster with rspamd, correct? Same for sentinel I would guess then if you can only point one endpoint?
Or maybe the headless endpoint with k8s that redirects to all redis pods addresses would do the trick?
Same for sentinel I would guess then if you can only point one endpoint?
Redis does support Redis Sentinel: https://rspamd.com/doc/configuration/redis.html
I am unsure about Redis Cluster as I do not see Rspamd mentions.
I am unsure about Redis Cluster as I do not see Rspamd mentions.
I remeber unsupported as it lacked some REDIS commands.
Some summary:
Some questions:
What configration parameter can be used to trigger the fallback? Shall we lower the trigger to say 10 seconds?
I really wonder if we shall not ignore failures upon key dispatch. It would not be that bad. We could make this configurable into redis conf? Because loosing the ability to send email if redis is down do not seem like a nice property to me!
What configration parameter can be used to trigger the fallback?
Open redis-cli
then command CLUSTER FAILOVER FORCE
on replicas node
ref: https://redis.io/docs/latest/commands/cluster-failover/
Shall we lower the trigger to say 10 seconds?
my opinion: lower is better The default configuration is 15 seconds Ref: https://raw.githubusercontent.com/redis/redis/7.2/redis.conf
I really wonder if we shall not ignore failures upon key dispatch. It would not be that bad. We could make this configurable into redis conf? Because loosing the ability to send email if redis is down do not seem like a nice property to me!
+1
The key dispatch by Redis for the notification feature is not critical,
Shall we lower the trigger to say 10 seconds?
WHat is the impact of false positive ie you fallback when there is nothing?
WHat is the impact of false positive ie you fallback when there is nothing?
Even when the master node is down, or not
if we run command CLUSTER FAILOVER FORCE
on replica node, then replica will "force" to the master immediately.
Old master -> replicas
WHat is the impact of false positive ie you fallback when there is nothing?
That was not the question.
Upon a master slave failover...
... do we loose unreplicated data?
... How long does the failover takes?
... Are there other side effects?
Based on these answer we might want to put a low value, or a defensive value to prevent too-frequent switches...
Upon a master slave failover...
... do we loose unreplicated data?
yes,
Example cases:
Data Loss due to Asynchronous Replication:
data1
to master Adata1
) which was not yet replicated to A1 is lost.Data Loss due to Partition:
node_timeout
is detected), client C successfully writes data2 to master node A.// the Redis document write that does not support strong consistency
... How long does the failover takes?
config: cluster-node-timeout
+ buffer 2 seconds
Redis document:
the cluster becomes available again after NODE_TIMEOUT time plus a few more seconds required for a replica to get elected and failover its master (failovers are usually executed in a matter of 1 or 2 seconds).
Related to
Shall we lower the trigger to say 10 seconds?
Updated answer: A longer duration would be preferable in the event of a partition issue where the client is situated on the same site as the failed master node. This ensures that data loss does not occur when the issue is resolved. For the case "Asynchronous Replication" above, it also can help the lost data that was not yet replicated, but the trade-off is long downtime
Why
Expectation: TMail core service should not be disrupted more or less by Redis outage.
How
Experiment on preprod with what happens to TMail deployment if Redis is down.
Some related Redis features:
Identify issues and propose enhancements to help TMail deployment be more fault tolerant and resilient.