synchronizer redis failed reconnection

azhurbilo commented 5 years ago

We have faced with Split-synchronizer issue in Prod using AWS Elasticache Redis 3.2.10 (single mode) when Amazon try to recreate node:

Recovering cache nodes 0001

note: ip address of redis endpoint preserve https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/ClientConfig.DNS.html https://forums.aws.amazon.com/thread.jspa?threadID=140869

But in split-synchronizer logs we were seeing Error log for several minutes till we have redeployed split-synchronizer service: reconnection not handled properly

December 26th 2018, 10:10:08.284 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:10:08 impression.go:35: dial tcp 10.10.100.100:6379: i/o timeout
December 26th 2018, 10:10:03.284 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:10:03 postimpressions.go:63: Error Retrieving
...
December 26th 2018, 10:08:34.282 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:08:34 impression.go:35: dial tcp 10.10.100.100:6379: getsockopt: no route to host
...
December 26th 2018, 10:08:31.281 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:08:31 postimpressions.go:63: Error Retrieving
...
December 26th 2018, 10:08:13.281 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:08:13 postimpressions.go:63: Error Retrieving
...
December 26th 2018, 10:08:01.282 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:08:01 impression.go:35: dial tcp 10.10.100.100:6379: getsockopt: no route to host
...
December 26th 2018, 10:07:41.320 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:07:41 metrics.go:31: read tcp 172.17.0.7:40692->10.10.100.100:6379: i/o timeout
...
December 26th 2018, 10:07:36.321 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:07:36 metrics.go:70: dial tcp 10.10.100.100:6379: i/o timeout
...
December 26th 2018, 10:07:35.365 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:07:35 postimpressions.go:63: Error Retrieving
...
December 26th 2018, 10:07:29.048 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:07:29 fetchsplits.go:58: Error saving till value into storage adapter. read tcp 172.17.0.7:40682->10.10.100.100:6379: i/o timeout
...
December 26th 2018, 10:07:15.364 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:07:15 impression.go:35: read tcp 172.17.0.7:40686->10.10.100.100:6379: i/o timeout
...
December 26th 2018, 10:07:05.364 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:07:05 impression.go:35: read tcp 172.17.0.7:40688->10.10.100.100:6379: i/o timeout
December 26th 2018, 10:07:05.364 | SPLITIO-AGENT \| ERROR: 2018/12/26 10:07:05 postimpressions.go:63: Error Retrieving

I've tried these steps with local redis, but it was not reproduced :(

1)  run redis:
docker run -d --name test-redis -p 6379:6379 redis:3
2) run sync:
docker run -ti --rm --name test-splitio -e SPLIT_SYNC_LOG_STDOUT=on -e SPLIT_SYNC_API_KEY=xxx -e SPLIT_SYNC_REDIS_HOST=<HOST_IP> splitsoftware/split-synchronizer:1.7.1
3) stop redis:
docker rm -f test-redis
4) wait when ERRORS appear
5) back redis container again 
docker run -d --name test-redis -p 6379:6379 redis:3
6) ERRORS disappeared; sync reconnected properly

May be you have some idea?

mredolatti commented 5 years ago

Hello @azhurbilo. I'm sorry to hear that you're having issues with our synchronizer. We'll dig into this and let you know as soon as we can reproduce it and come up with a solution.

Just to be sure. Were you able to verify that the cluster was indeed recreated withe the same ip address?

Thanks.

azhurbilo commented 5 years ago

Were you able to verify that the cluster was indeed recreated withe the same ip address?

we just see message in AWS log about instance recreation. Our monitoring system which connected directly to redis port from another VM host have tracked this downtime and reconnected automatically in 1 minute.

mredolatti commented 5 years ago

Hello @azhurbilo, we've been taking a look at this issue and we suspect that it's related to your redis cluster being re-created with the same hostname but a different IP. According to your logs, it looks like the IP of your cluster used to be 10.10.100.100, before amazon re-created your instance. Would you mind checking if the host that you're currently connected to (after being re-created) points to the same address?

On linux you can do this by typing dig <REDIS_HOST> and you should see something like:

;; ANSWER SECTION:
<REDIS_HOST>        234 IN  A   XXX.XXX.XXX.XXX

This is just to confirm that the host IP has indeed changed.

Also, do you remember how much time you waited for the synchronizer to re-connect to redis until you restarted it?

There's a chance that our timeouts and tcp keep-alive configurations might be on the high side as well.

In the meantime we will try to replicate the issue with an elasticache instance within our AWS environment.

We will keep you posted with our findings.

Thank you very much for reaching out, and once again, our apologies for the inconveniences caused.

azhurbilo commented 5 years ago

would you mind checking if the host that you're currently connected to (after being re-created) points to the same address?

yes, it's the same 10.10.100.100 IP address

Also, do you remember how much time you waited for the synchronizer to re-connect to redis until you restarted it?

9 min

In the meantime we will try to replicate the issue with an elasticache instance within our AWS environment.

great, thnx!

azhurbilo commented 4 years ago

Today faced with the same issue in Prod :( we use 2.5.2 version @mredolatti any news?

mredolatti commented 4 years ago

Hi @azhurbilo ,

We apologize for the delay on this. We were never able to reproduce this issue and we eventually had other issues to tackle.

It seems like every time an elasticache is created, a new hostname & ip address are generated, different from the previous one.

How exactly are you generating this? Are you updating a CNAME entry in your DNS records everytime you create an elasticache cluster, and using that CNAME as the redis host?

We want to replicate your environment as much as possible for debugging this.

Thanks! Martin

azhurbilo commented 4 years ago

I can tell that AWS in this case trigger event ElastiCache:CacheNodeReplaceStarted

https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/ElastiCacheSNS.html https://aws.amazon.com/elasticache/faqs/

ElastiCache has detected that the host running a cache node is degraded
or unreachable and has started replacing the cache node.

Note

The DNS entry for the replaced cache node is not changed.

In most instances, you do not need to refresh the server-list for your clients 
when this event occurs. However, some cache client libraries may stop using 
the cache node even after ElastiCache has replaced the cache node; 
in this case, the application should refresh the server-list when this event occurs.

And we use additional CNAME (which never changed) to DNS name of AWS redis entrypoint.

may be it could help somehow.

Israphel commented 4 years ago

@azhurbilo out of curiousity, does your elasticache cluster have 0 replicas?

azhurbilo commented 4 years ago

yes it's 1 node redis elasticache instance

mredolatti commented 4 years ago

Hi @azhurbilo,

what's the TTL on your DNS records? We've been doing some tests and there's no DNS cache on our synchronizer, and no cache on the redis library & underlying TCP module. It seems like only OS-level DNS cache might be used.

NicoZelaya commented 4 years ago

Hi @azhurbilo,

it's been a while so I wanted to check the status on this. Are you still experiencing any issues with it? Could you investigate into your TTL for DNS records? Were there any other developments from your end?

We'll wait for a few days to hear back from you and then close the issue but feel free to reopen at any time if needed.

Best, Nico.

azhurbilo commented 4 years ago

sorry for not responding. Yes, we can close this ticket as I am not working in the company now where this issue was reproduced.

splitio / split-synchronizer

synchronizer redis failed reconnection #50