Connection timed out to AWS elasticache

xaratt commented 11 years ago

Found a strange bug when was trying to run twemproxy with cluster of elasticache (Amazon cloud memcached) servers. Amazon use CNAMEs as entry points for elasticache servers and twemproxy could connect to the backend memcached on start, but couldn't send any request to them. If I use "direct" hostnames for the backend servers, all requests are ok.

user@localhost:~$ telnet my.proxy.server 11311
Trying xx.xx.xx.xx...
Connected to xx.xx.xx.xx.
Escape character is '^]'.
get foo
SERVER_ERROR Connection timed out
^]

twemproxy config:

staging-cache:
  listen: 0.0.0.0:11311
  hash: fnv1a_64
  distribution: ketama
  timeout: 10000
  backlog: 1024
  preconnect: true
  auto_eject_hosts: true
  server_retry_timeout: 30000
  server_failure_limit: 3
  servers:
   - myserver.0001.use1.cache.amazonaws.com:11211:1
   - myserver.0002.use1.cache.amazonaws.com:11211:1
   - myserver.0003.use1.cache.amazonaws.com:11211:1
   - myserver.0004.use1.cache.amazonaws.com:11211:1

twemproxy was running as

user@my.proxy.server:~$ nutcracker -c /etc/nutcracker.yml -v 11

Here is part of twemproxy log: http://pastebin.com/DTE8gAva

When I modified servers section:

  servers:
   - ec2-xx-xx-xx-xx.compute-1.amazonaws.com:11211:1
   - ec2-xx-xx-xx-xx.compute-1.amazonaws.com:11211:1
   - ec2-xx-xx-xx-xx.compute-1.amazonaws.com:11211:1
   - ec2-xx-xx-xx-xx.compute-1.amazonaws.com:11211:1

I received response:

user@localhost:~$ telnet my.proxy.server 11311
Trying xx.xx.xx.xx...
Connected to xx.xx.xx.xx.
Escape character is '^]'.
get foo
END
^]

And, of course, *.cache.amazonaws.com could be resolved from instance where twemproxy is running:

user@my.proxy.server:~$ host myserver.0002.use1.cache.amazonaws.com
myserver.0002.use1.cache.amazonaws.com is an alias for ec2-xx-xx-xx-xx.compute-1.amazonaws.com.
ec2-xx-xx-xx-xx.compute-1.amazonaws.com has address xx-xx-xx-xx

P.S. Oct 26 code snapshot was used; Ubuntu 12.04.1 x86_64

manjuraj commented 11 years ago

@xaratt this is really odd.

what happens when you try to connect to the AWS cache servers from "my.proxy.server" directly. So, in one try you use

printf "get foo\r\n" | nc myserver.0001.use1.cache.amazonaws.com 11211

And in the next try you do

printf "get foo\r\n" | nc ec2-xx-xx-xx-xx.compute-1.amazonaws.com 11211

do both of them work for you?

Also does ping of "ec2-xx-xx-xx-xx.compute-1.amazonaws.com" and " myserver.0001.use1.cache.amazonaws.com" resolve to different addresses?

xaratt commented 11 years ago

Thank you for response.

Yes, when I try to connect directly to cache servers, both of them works fine:

user@my.proxy.server:~$ printf "get foo\r\n" | nc myserver.0001.use1.cache.amazonaws.com 11211
END
user@my.proxy.server:~$ printf "get foo\r\n" | nc ec2-xx-xx-xx-xx.compute-1.amazonaws.com 11211
END

And ping of both server names resolve to the same address. I can give you real IP of "ec2-xx-xx-xx-xx", but Amazon's FAQ say that "Currently, all clients to an ElastiCache Cluster must be within the Amazon EC2 network".

I want to try to run twemproxy with memcached on my non-AWS server using CNAME and IP for connection. I'll post results here.

xaratt commented 11 years ago

I made few new attempts and found configuration which allow me use twemproxy with AWS Elasticaches. I created own CNAMEs which points on Amazon's myserver.000x.use1.cache.amazonaws.com servers and twemproxy works fine with this strange scheme:

cache1.example.com -> myserver.0001.use1.cache.amazonaws.com -> ec2-xx-xx-xx-xx.compute-1.amazonaws.com

Is this bug (feature?) in Amazon DNS system? I don't know.

tom-dalton-fanduel commented 9 years ago

Xaratt did you ever get to the bottom of this?

We recently saw a similar-sounding issue (twemproxy appears to connect but then commands result in "ERR Connection timed out" with an Elasticache Redis write endpoint. It was intermittent but I will attempt to gather more info if it happens again.

xaratt commented 9 years ago

@tom-dalton-fanduel, sorry for delay in responding, but we didn't found source or solution for this problem. We only added CNAMEs for each of our memcache nodes.

tom-dalton-fanduel commented 9 years ago

No problem - it looks like this was unrelated to an issue we were looking at!

digitalprecision commented 7 years ago

Confused. Isn't EC Cluster endpoint a proxy, like twemproxy? Why deal with two proxies, rather set your app to connect directly to the EC Cluster endpoint.

tom-dalton-fanduel commented 7 years ago

Twemproxy is more than just a proxy, it provides transparent sharding too. In my case (and I'm guessing @xaratt 's too?) twemproxy is used to shard across multiple EC [write] endpoints.

smehtaCAS commented 5 years ago

@tom-dalton-fanduel I am having similar issue as you. Did you find out what the problem was. Thanks

tom-dalton-fanduel commented 5 years ago

I'm afraid I don't even remember the context for my comment, let alone if we ever solved it. We've since moved away from Twemproxy.

digitalprecision commented 5 years ago

Same... we ended up using aws elasticache instead.

On Wed, Jul 10, 2019 at 9:55 AM Tom Dalton notifications@github.com wrote:

I'm afraid I don't even remember the context for my comment, let alone if we ever solved it. We've since moved away from Twemproxy.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/twitter/twemproxy/issues/18?email_source=notifications&email_token=AABZJXQ3QQGRB6ZTI22PFQ3P6YH6ZA5CNFSM4ABZPRQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZUCUFQ#issuecomment-510142998, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZJXTLMDNE3MHZDHFGCE3P6YH6ZANCNFSM4ABZPRQA .

smehtaCAS commented 5 years ago

We use AWS elasticache as well but intermittently we get "ERR Connection timed out" Could not see anything in the logs.

digitalprecision commented 5 years ago

Did u try installing the aws specific module? It’s supposed to support ejections

On Wednesday, July 10, 2019, smehtaCAS notifications@github.com wrote:

We use AWS elasticache as well but intermittently we get "ERR Connection timed out" Could not see anything in the logs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/twitter/twemproxy/issues/18?email_source=notifications&email_token=AABZJXUQZ3JFXQD76KTHAVLP62TCTA5CNFSM4ABZPRQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZVMYYY#issuecomment-510315619, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZJXS4XTNYWG2GRZM7MLDP62TCTANCNFSM4ABZPRQA .

smehtaCAS commented 5 years ago

I have not. Can you point me to it. Thanks!

digitalprecision commented 5 years ago

I can't remember specifically, it's been a few years, but a google search should find something. Or start here: https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/Appendix.PHPAutoDiscoverySetup.html

On Wed, Jul 10, 2019 at 8:52 PM smehtaCAS notifications@github.com wrote:

I have not. Can you point me to it. Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/twitter/twemproxy/issues/18?email_source=notifications&email_token=AABZJXWLZIE6YLS6OBP5UETP62U7VA5CNFSM4ABZPRQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZVNPCQ#issuecomment-510318474, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZJXSQB3BQ2FAXVMYAEFLP62U7VANCNFSM4ABZPRQA .

TysonAndre commented 3 years ago

I wonder if https://github.com/twitter/twemproxy/pull/567 was related - until that's merged, ketama_max_hostlen is 86. (if it works for short names but not long names)

But I don't see how that'd possibly be the issue, it'd just hash requests to the wrong host. (snprintf still appends null characters)

(looking at this issue while looking into whether timeouts are more likely with elasticache in general - doesn't seem like it)

TysonAndre commented 3 years ago

Leaving a note on this to refer back to later in case anyone else has issues with elasticache memcached - the issue I'm looking into is unrelated

https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/ParameterGroups.Memcached.html suggests there's no timeout, so I'm confused

It could be a bug parsing requests or responses in an older version
It could be intermittent issues with connectivity from a datacenter or application to elasticache

Elasticache itself doesn't have an idle timeout according to recent documentation for 1.4, not sure if old versions were different

idle_timeout	Default: 0 (disabled)Type: integerModifiable: YesChanges Take Effect: At Launch	The minimum number of seconds a client will be allowed to idle before being asked to close. Range of values: 0 to 86400.

I think https://github.com/twitter/twemproxy/pull/324/files#diff-01600ca8f8e542768f785de1842f38b3aeeb315531c63b9d2ce8730a21f72a80 may help (related to the redis sentinel support proposal), but I still get occasional timeouts to elasticache when there's low traffic anyway

twitter / twemproxy

Connection timed out to AWS elasticache #18