Closed jennyfountain closed 3 years ago
@manjuraj We are seeing similar issues referenced here https://github.com/twitter/twemproxy/issues/145
Here is our config.
listen: /var/run/nutcracker/our.socket 0777 auto_eject_hosts: true distribution: ketama hash: one_at_a_time backlog: 65536 server_connections: 16 server_failure_limit: 3 server_retry_timeout: 30000 servers:
We are seeing a major backup in out_queue and it basically makes our site unusable.
In addition, auto_eject_hosts: true is not working as we thought.
Thanks for any insight or information! -J
@jennyfountain - I believe some of the issues can be solved by following the recommendations listed out here: https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md
Could you try applying for the timeout: parameter and setting server_connections to 1 for you cluster
Also do you have client side retries? If so, the number of retries must be at least 4 for a request to succeed in the event of a server failure and ejection
The reason you are noticing backlog is because memcahce is getting more load that expected and taking longer to respond.
How many memcahe servers do you have? Across how many physical machines is it distributed and how much load does each instance get at peak? Do you have p50, p90 and p95 latency numbers?
We had set it to 1 and same results. We increased it hoping it would help but it didn't.
We also set timeout to 2000.
We currently have 29 memcache servers in the pool.
We do not have client side retries. When we went directly to the memcache servers, we did not see this issue.
Listed below is the current config we are using. Could this be a hash issue?
listen: /var/run/nutcracker/our.socket 0777 auto_eject_hosts: true distribution: ketama hash: one_at_a_time backlog: 65536 server_connections: 16 server_failure_limit: 3 server_retry_timeout: 30000 servers: x.x.x.x:11211:1 x.x.x.x:11211:1 x.x.x.x:11211:1 timeout: 2000
Thank you for your help on this!
one_at_a_time is a not an ideal hash function, but I doubt that is the issue. Also changing the hash at this point would imply that it would change the routing of a key to the backend memcache (redistribute the sharding). Unless you are bring up a new cluster of memcache, not such a good idea
Just curious - What would you suggest as the ideal hash function?
Looking at our configs, does anything stand out? could this be socket issue? timeout issue?
murmur would be my goto hash function; fnv1a_64 is good too
your config looks fine;
At high load, is your CPU for twemproxy machines maxed out?
At high load, can you paste me the values for the following stats:
No, CPU/Memory look perfect. Our out_queue backups and time in memcache increases from 100ms to 600ms
Yes! I will push this and paste in a second.
also paste values for https://github.com/twitter/twemproxy/blob/master/src/nc_stats.h#L48 and not out_queue_bytes
I included all of the stats for each server (sanitized :D) during our test.
Thank you so much!
@jennyfountain I looked at stats.txt and nothing really jumps out :( Have you used tools like mctop or twctop for monitoring or twemperf for load testing?
Also in the load testing are you using requesting the same key over and over again?
I have mctop installed. I will start that during a load test and see if I can spot anything.
Does it matter if I am using https://memcached.org/? Version 1.4.17?
twctop seems to be throwing some errors and won't run for me so I will investigate that more.
thanks!
twctop only works with twemcache
We narrowed it down to PHP. No matter what version of PHP, libmemache or memcached module, it's about 50% slower than nodejs.
@manjuraj
here is sample code that I am using as a test. Nodejs and memslap are seeing no issues :(.
private function createCacheObject() { $this->cache = new Memcached('foo'); $this->cache->setOption( Memcached::OPT_DISTRIBUTION, Memcached::DISTRIBUTION_CONSISTENT ); $this->cache->setOption( Memcached::OPT_LIBKETAMA_COMPATIBLE, true ); $this->cache->setOption( Memcached::OPT_SERIALIZER, Memcached::SERIALIZER_PHP ); $this->cache->setOption( Memcached::OPT_TCP_NODELAY, false ); $this->cache->setOption( Memcached::OPT_HASH, Memcached::HASH_MURMUR ); $this->cache->setOption( Memcached::OPT_SERVER_FAILURE_LIMIT, 256 ); $this->cache->setOption( Memcached::OPT_COMPRESSION, false ); $this->cache->setOption( Memcached::OPT_RETRY_TIMEOUT, 1 ); $this->cache->setOption( Memcached::OPT_CONNECT_TIMEOUT, 1 * 1000 ); $this->cache->addServers($this->servers[$this->serverConfig]); }
I believe this can be closed. I'm working on the same application as jennyfountain and the issue no longer occurs. I'm guessing the original issue was cpu starvation (or maybe some other misconfiguration causing slow syscalls)
After this issue was filed,
This application continues to be stable after avoiding cpu starvation and moving to 4/8 nutcracker instances per host to keep nutcracker cpu usage consistently below 100% .
We are seeing a few issues that I was hoping someone could help me resolve or point me in the right direction.
Example (sometimes goes into 2k/3k range as well): "out_queue_bytes": 33 "out_queue_bytes": 91 "out_queue_bytes": 29 "out_queue_bytes": 29 "out_queue_bytes": 174
In addition, it shows that our time spent in memcache goes up from 400 ms to 1000-2000 ms. This seriously affects our application.
here is an example of a config:
web: listen: /var/run/nutcracker/web.sock 0777 auto_eject_hosts: true distribution: ketama hash: one_at_a_time backlog: 65536 server_connections: 16 server_failure_limit: 3 server_retry_timeout: 30000 servers:
somaxconn = 128
What we tried and didn't help.
Thank you for any guidance on this problem.