NChan no longer works with clustered Redis in AWS Elasticache

mkdewidar commented 1 year ago

Hi,

It seems that starting from v1.2.15 (technically, v1.2.13 but that was withdrawn), NChan can no longer be used with AWS Elasticache Redis clusters. It fails to establish connections with the cluster citing (with debug logs enabled):

nchan: Redis node 172.19.220.15:6379 node_connector_callback state 1 nchan: Redis node 172.19.220.15:6379 node_connector_callback state 2 nchan: Redis node 172.19.220.15:6379 node_connector_callback state 4 nchan: Redis node 172.19.220.15:6379 node_connector_callback state 6 nchan: Redis node 172.19.220.15:6379 node_connector_callback state 10 nchan: Redis node 172.19.220.15:6379 node_connector_callback state 13 nchan: Redis node 172.19.220.15:6379 node_connector_callback state 14 nchan: Redis node 172.19.220.15:6379 all scripts loaded nchan: Redis node 172.19.220.15:6379 node_connector_callback state 17 nchan: Redis slave node 172.19.220.15:6379 node_connector_callback state 19 nchan: Redis master node :6379 connection failed: IP address connects to more than one server. Is Redis behind a proxy?

Elasticache Redis clusters are not behind a proxy, though there is some sort of DNS load balancing that happens. Clients use DNS to resolve a fixed hostname (called the "configuration endpoint") to any one of the cluster's nodes, and then discover the IP addresses of the other nodes in the cluster using standard Redis cluster commands.

From what I can tell, the root of the issue here seems to be that as part of the Redis TLS support changes, the pubsub connection now connects to Redis by cp->hostname, rather than cp->peername in node_connector_callback for the REDIS_NODE_CMD_CONNECTING state.

I applied the following patch and tested and it seemed to fix it, however I don't know enough about TLS or NChan to know if this is a reliable solution or not.

--- build/nchan-1.3.6/src/store/redis/redis_nodeset.c
+++ build/nchan-1.3.6/src/store/redis/redis_nodeset.c
@@ -1972,7 +1972,14 @@
     node_log_error(node, "redis hostname is too long");
     return NULL;
   }
-  ngx_memcpy(hostchr, rcp->hostname.data, rcp->hostname.len);
+
+  if (rcp->peername.len != 0) {
+    ngx_memcpy(hostchr, rcp->peername.data, rcp->peername.len);
+  }
+  else {
+    ngx_memcpy(hostchr, rcp->hostname.data, rcp->hostname.len);
+  }
+
   ac = redisAsyncConnect((const char *)hostchr, rcp->port);
   if (ac == NULL) {
     node_log_error(node, "count not allocate Redis context");

slact commented 1 year ago

What version of Nchan are you running? What does your Nginx conf look like? (edit out the private details)

mkdewidar commented 1 year ago

Currently on 1.2.10, and our resultant config looks something along the lines of:

...
http {
    ...
    upstream redis_cluster {
        nchan_redis_server redis://clustername.clustercfg.region.cache.amazonaws.com:6379;
        nchan_redis_storage_mode nostore;
        nchan_redis_nostore_fastpublish on;
        nchan_redis_subscribe_weights master=1 slave=1000;
    }

    nchan_shared_memory_size    256M;
    nchan_message_timeout       1m;
    nchan_message_buffer_length 3;

    ...
    server {
        ...
        location ~ /someurl {
            internal;

            nchan_subscriber;
            nchan_subscriber_first_message newest;
            nchan_channel_id $1;
            nchan_redis_pass redis_cluster;
            nchan_eventsource_ping_interval 60;
            nchan_eventsource_ping_comment "ping";
            nchan_eventsource_ping_event "";
            nchan_subscribe_request /notifyurl;
            nchan_unsubscribe_request /othernotifyurl;
        }

        location ~ /someotherurl {
            internal;

            nchan_publisher;
            nchan_channel_id $1;
            nchan_redis_pass redis_cluster;
            nchan_channel_id_split_delimiter ",";

            nchan_max_channel_id_length 32768;
        }
    }
}

slact commented 1 year ago

1.2.10 is over 2 years and 10 releases behind. Please try the latest version (1.3.6). Elasticache should work just fine.

mkdewidar commented 1 year ago

Sorry, I think I might've made things a bit confusing. 1.2.10 is working fine. We are facing these issues when we try to upgrade to 1.3.6. We are seeing this issue only when using Elasticache with clustered Redis. Our other service that uses non-clustered Redis is working fine with 1.3.6.

I think I misunderstood what you meant by "what version are you running", sorry!

mkdewidar commented 1 year ago

Hi @slact, have you had a chance to look into this further?

slact commented 1 year ago

What is your ElastiCache configuration? Is TLS enabled? AUTH?

mkdewidar commented 1 year ago

No TLS or Auth in this case no. Just a cluster with a couple shards and replicas running on Redis 7.

slact commented 1 year ago

Strange. I have no problem whatsoever using ElastiCache, any version, on any modern Nchan version.

Please try the following (separately):

Set nchan_redis_server to just the config endpoint, no "redis://", no port
Comment out nchan_redis_subscribe_weights
Set nchan_redis_server to one of the cluster's nodes DNS address directly (from the node listing)

Let me know which of these work, if any

piotr-lwks commented 1 year ago

How is nchan discovering nodes from ElastiCache? I would like to simulate it locally with DNS-based HAProxy setup.

mkdewidar commented 1 year ago

@slact Sorry for the delay in getting back to you. I tried these separately as you suggested:

Setting the nchan_redis_server to just the config endpoint (i.e nchan_redis_server clustername.clustercfg.region.cache.amazonaws.com;) did not work.
Commenting out nchan_redis_subscribe_weights also didn't work.
Setting nchan_redis_server to one of the nodes directly (i.e to "cluster-name-0001-001.randomchars.0001.region.cache.amazonaws.com") works. But I should point out that AWS recommends using the configuration endpoint https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Endpoints.html for Elasticache Redis clusters with cluster mode enabled. In fact, some tools, like Terraform, don't expose those per-node endpoints, and so one would have to construct the URL manually.

You mentioned you had no issues reproducing the issue, were you using the configuration endpoint or that of the individual nodes? The issue is specific to using the configuration endpoint due to its use of some form of a round-robin DNS. The configuration endpoint worked with NChan until v1.2.15.

slact commented 1 year ago

Yeah, I had no problem using the shared config endpoint with Roundrobin DNS.

Please try the following: set the logging level to 'debug'', and grep through the log for anything with redis. Please post the results -- or email it to me.

slact / nchan

NChan no longer works with clustered Redis in AWS Elasticache #662