Latency problem with 1k req/sec

zavada commented 10 years ago

I'm using 2.8.17 redis server. Send requests throw unix.sock by nginx+lua. Each request is one SMEMBERS command. Now I have about 1k req per sec. In the nginx access log I see mostly next

[09/Oct/2014:18:04:49 +0300] [ 1412867089.670 : 0.002 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.670 : 0.002 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.670 : 0.002 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.670 : 0.002 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.671 : 0.001 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.672 : 0.000 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.673 : 0.001 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.673 : 0.001 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.674 : 0.000 : 198 ] 212.1.66.241 - - "POST     /API/0.1/url/getCategory/ HTTP/1.1" 200 37 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.681 : 0.000 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29

You can see that request time is not more than 2ms. But almost every second I have a few requests with request time about 100ms

[09/Oct/2014:18:05:22 +0300] [ 1412867122.843 : 0.131 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:05:23 +0300] [ 1412867123.212 : 0.100 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:05:23 +0300] [ 1412867123.613 : 0.101 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29

I can't find the problem place. I checked everything from the latency guide Slowlog has no entries System is not swapping I don't use AOF

I read a lot of posts on stackoverflow like this http://stackoverflow.com/questions/16841469/redis-performance-tuning/23719773#23719773 and I can't find the solution.

redis-benchmark gives me next

redis-benchmark -s /tmp/redis.sock -q -n 10000 -c 1 -d 50
PING_INLINE: 51282.05 requests per second
PING_BULK: 64935.07 requests per second
SET: 39840.64 requests per second
GET: 28901.73 requests per second
INCR: 26525.20 requests per second
LPUSH: 29411.76 requests per second
LPOP: 23255.81 requests per second
SADD: 21551.72 requests per second
SPOP: 23041.47 requests per second
LPUSH (needed to benchmark LRANGE): 45045.04 requests per second
LRANGE_100 (first 100 elements): 10101.01 requests per second
LRANGE_300 (first 300 elements): 4196.39 requests per second
LRANGE_500 (first 450 elements): 3337.78 requests per second
LRANGE_600 (first 600 elements): 7320.64 requests per second
MSET (10 keys): 62111.80 requests per second

agentzh commented 10 years ago

@zavada What nginx variable are you using for the "request time" field in your access log? Is it $request_time? If yes, then the time specified by the $request_time variable includes the network I/O time between the client and your web server, which can be relatively large for slow downstream networks.

zavada commented 10 years ago

@agentzh Yes, I use $request_time. But if I comment the line

local categories, err = red:smembers("categories:"..clear_url)

I don't have in the result this >100ms time requests. Thats why I think that problem is not in network I/O time between the client and my web server.

Maybe do you have other idea?

agentzh commented 10 years ago

@zavada Then you'll have to use tools like systemtap and dtrace to grab more details from those slow request samples in various places: nginx processes, kernel network stack, and the redis-server process. See my sample tools based on systemtap in the nginx-systemtap-toolkit and stapxx projects on GitHub for some ideas.

agentzh commented 10 years ago

@zavada Also, $request_time may have some error due to the time caching mechanism in the nginx core. Use the ngx-lua-tcp-recv-time tool in my stapxx project to measure the upstream cosocket recv latency time accurately at real time:

https://github.com/openresty/stapxx#ngx-lua-tcp-recv-time

agentzh commented 10 years ago

@zavada Furthermore, watch the CPU usage of your nginx worker processes, if their CPU usage can hit 100% from time to time, then you might just run out of CPU time and such intermittent long latency is totally reasonable. Similarly, you should watch the CPU usage of your redis-server process as well.

If any process is exhausting the CPU time, then you should use the various flame graph sampling tools to analyze the bottleneck:

https://github.com/openresty/nginx-systemtap-toolkit#sample-bt

https://github.com/openresty/stapxx#lj-lua-stacks

Regards, -agentzh

agentzh commented 10 years ago

@zavada Sometimes, nginx may just block on some blocking I/O system calls (like file I/O) or semaphores used as some internal locks in nginx. Such blocking things can contribute to your cosocket latency as well. You can just the epoll-loop-blocking-distr tool to verify this:

https://github.com/openresty/stapxx#epoll-loop-blocking-distr

and further use the off-CPU time flame graph tool to analyze the causes:

https://github.com/openresty/nginx-systemtap-toolkit#sample-bt-off-cpu

zavada commented 10 years ago

@agentzh When I try to run your examples I have next error Checking "/lib/modules/2.6.32-431.el6.x86_64/build/.config" failed with error: No such file or directory On my centos server I have "/lib/modules/2.6.32-431.el6.x86_64/build/" but don't ".config" inside. Do you know how I can fix it?

Thank you for your time and help.

agentzh commented 10 years ago

Hello!

On Sun, Oct 12, 2014 at 2:34 AM, zavada wrote:

@agentzh When I try to run your examples I have next error Checking "/lib/modules/2.6.32-431.el6.x86_64/build/.config" failed with error: No such file or directory On my centos server I have "/lib/modules/2.6.32-431.el6.x86_64/build/" but don't ".config" inside. Do you know how I can fix it?

Have you installed the "kernel-devel" and "kernel-debuginfo" packages for your kernel actually being used?

See the following documentation for more details:

https://www.sourceware.org/systemtap/SystemTap_Beginners_Guide/using-systemtap.html#install-kinfo

Regards, -agentzh

openresty / lua-resty-redis

Latency problem with 1k req/sec #50