Questions about twemproxy and performance

jennyfountain commented 7 years ago

We are seeing a few issues that I was hoping someone could help me resolve or point me in the right direction.

During high loads, we are seeing a lot of backup in the out_queue_bytes. On normal traffic loads, this is 0.

Example (sometimes goes into 2k/3k range as well): "out_queue_bytes": 33 "out_queue_bytes": 91 "out_queue_bytes": 29 "out_queue_bytes": 29 "out_queue_bytes": 174

In addition, it shows that our time spent in memcache goes up from 400 ms to 1000-2000 ms. This seriously affects our application.

auto eject also seems to not work as expected. Server goes down and our app freaks out - saying it cannot access a memcache server.

here is an example of a config:

web: listen: /var/run/nutcracker/web.sock 0777 auto_eject_hosts: true distribution: ketama hash: one_at_a_time backlog: 65536 server_connections: 16 server_failure_limit: 3 server_retry_timeout: 30000 servers:

1.2.3.4:11211:1
1.2.3.5:11211:1
1.2.3.6:11211:1 timeout: 2000

somaxconn = 128

What we tried and didn't help.

mbuf to 512
server connection from 1 to 200

Thank you for any guidance on this problem.

jennyfountain commented 7 years ago

@manjuraj We are seeing similar issues referenced here https://github.com/twitter/twemproxy/issues/145

Here is our config.

listen: /var/run/nutcracker/our.socket 0777 auto_eject_hosts: true distribution: ketama hash: one_at_a_time backlog: 65536 server_connections: 16 server_failure_limit: 3 server_retry_timeout: 30000 servers:

x.x.x.x:11211:1
x.x.x.x:11211:1
x.x.x.x:11211:1 timeout: 2000

We are seeing a major backup in out_queue and it basically makes our site unusable.

In addition, auto_eject_hosts: true is not working as we thought.

Thanks for any insight or information! -J

manjuraj commented 7 years ago

@jennyfountain - I believe some of the issues can be solved by following the recommendations listed out here: https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md

Could you try applying for the timeout: parameter and setting server_connections to 1 for you cluster

manjuraj commented 7 years ago

Also do you have client side retries? If so, the number of retries must be at least 4 for a request to succeed in the event of a server failure and ejection

The reason you are noticing backlog is because memcahce is getting more load that expected and taking longer to respond.

How many memcahe servers do you have? Across how many physical machines is it distributed and how much load does each instance get at peak? Do you have p50, p90 and p95 latency numbers?

jennyfountain commented 7 years ago

We had set it to 1 and same results. We increased it hoping it would help but it didn't.

We also set timeout to 2000.

We currently have 29 memcache servers in the pool.

We do not have client side retries. When we went directly to the memcache servers, we did not see this issue.

Listed below is the current config we are using. Could this be a hash issue?

listen: /var/run/nutcracker/our.socket 0777 auto_eject_hosts: true distribution: ketama hash: one_at_a_time backlog: 65536 server_connections: 16 server_failure_limit: 3 server_retry_timeout: 30000 servers: x.x.x.x:11211:1 x.x.x.x:11211:1 x.x.x.x:11211:1 timeout: 2000

Thank you for your help on this!

manjuraj commented 7 years ago

one_at_a_time is a not an ideal hash function, but I doubt that is the issue. Also changing the hash at this point would imply that it would change the routing of a key to the backend memcache (redistribute the sharding). Unless you are bring up a new cluster of memcache, not such a good idea

jennyfountain commented 7 years ago

Just curious - What would you suggest as the ideal hash function?

Looking at our configs, does anything stand out? could this be socket issue? timeout issue?

manjuraj commented 7 years ago

murmur would be my goto hash function; fnv1a_64 is good too

manjuraj commented 7 years ago

your config looks fine;

At high load, is your CPU for twemproxy machines maxed out?

manjuraj commented 7 years ago

At high load, can you paste me the values for the following stats:

jennyfountain commented 7 years ago

No, CPU/Memory look perfect. Our out_queue backups and time in memcache increases from 100ms to 600ms

jennyfountain commented 7 years ago

Yes! I will push this and paste in a second.

manjuraj commented 7 years ago

also paste values for https://github.com/twitter/twemproxy/blob/master/src/nc_stats.h#L48 and not out_queue_bytes

jennyfountain commented 7 years ago

stats.txt

I included all of the stats for each server (sanitized :D) during our test.

Thank you so much!

manjuraj commented 7 years ago

@jennyfountain I looked at stats.txt and nothing really jumps out :( Have you used tools like mctop or twctop for monitoring or twemperf for load testing?

Also in the load testing are you using requesting the same key over and over again?

jennyfountain commented 7 years ago

I have mctop installed. I will start that during a load test and see if I can spot anything.

Does it matter if I am using https://memcached.org/? Version 1.4.17?

twctop seems to be throwing some errors and won't run for me so I will investigate that more.

thanks!

manjuraj commented 7 years ago

twctop only works with twemcache

jennyfountain commented 7 years ago

We narrowed it down to PHP. No matter what version of PHP, libmemache or memcached module, it's about 50% slower than nodejs.

@manjuraj

here is sample code that I am using as a test. Nodejs and memslap are seeing no issues :(.

private function createCacheObject() { $this->cache = new Memcached('foo'); $this->cache->setOption( Memcached::OPT_DISTRIBUTION, Memcached::DISTRIBUTION_CONSISTENT ); $this->cache->setOption( Memcached::OPT_LIBKETAMA_COMPATIBLE, true ); $this->cache->setOption( Memcached::OPT_SERIALIZER, Memcached::SERIALIZER_PHP ); $this->cache->setOption( Memcached::OPT_TCP_NODELAY, false ); $this->cache->setOption( Memcached::OPT_HASH, Memcached::HASH_MURMUR ); $this->cache->setOption( Memcached::OPT_SERVER_FAILURE_LIMIT, 256 ); $this->cache->setOption( Memcached::OPT_COMPRESSION, false ); $this->cache->setOption( Memcached::OPT_RETRY_TIMEOUT, 1 ); $this->cache->setOption( Memcached::OPT_CONNECT_TIMEOUT, 1 * 1000 ); $this->cache->addServers($this->servers[$this->serverConfig]); }

TysonAndre commented 3 years ago

I believe this can be closed. I'm working on the same application as jennyfountain and the issue no longer occurs. I'm guessing the original issue was cpu starvation (or maybe some other misconfiguration causing slow syscalls)

After this issue was filed,

The rpm was updated and the strategy used is now significantly
The servers were rebuilt on new generation hardware and peak cpu usage is much lower so cpu starvation is no longer an issue
The application was upgraded to a newer php version
Other bottlenecks/bugs were fixed

TysonAndre commented 3 years ago

This application continues to be stable after avoiding cpu starvation and moving to 4/8 nutcracker instances per host to keep nutcracker cpu usage consistently below 100% .

twitter / twemproxy

Questions about twemproxy and performance #513