startup causing what appear to be false timeouts

ckstettler commented 6 years ago

To troubleshoot, I am using a server that returns after a 1 sec sleep. The payload of the response something like below where health.count is a Java AtomicInteger value that is incremented every time the resource is invoked:

{"status":"UP","customHealthCheck":{"status":"UP","health.count":8}

My wrk command line is:

wrk -s chip.lua -R 1 -t 1 -c 10 -d 30 http://localhost:8080/cc-auth-gateway/health

The lua script is simple:

function init(args) requests = 0 responses = 0 wrk.method = 'GET' wrk.headers["Api-Key"]="RTM";wrk.headers["User-Id"]="TEST";wrk.headers["Accept"]="application/json;v=1";wrk.headers["Content-Type"]="application/json;v=1"; end

function response(status, headers, body) --print(status, body) end

I put the line labeled "stopped" in the response_completed(). I put the "timeout" in check_timeout(). The "stopped" shows the diff between c->start and now which is way less than a sec. It also shows the body being the same for nearly all of the connections.

In addition the health.count is identical. If it had really called my server the health.count would have shown an incremental count value.

The server logs showed that the endpoint was called only once and not the 10 times that would have been expected.

.. ..

Running 30s test @ http://localhost:8080/cc-auth-gateway/health 1 threads and 10 connections stopped: conn: 0x7fe4c50020f8 now:1522155115404266 start:1522155115402504 diff:1762 status: 200 parser: 0x7fe4c5002100 body: 0x7fe4c5800400 body:{"status":"UP","customHealthCheck":{"status":"UP","health.count":13},"diskSpace" stopped: conn: 0x7fe4c50041f0 now:1522155115409766 start:1522155115408162 diff:1604 status: 200 parser: 0x7fe4c50041f8 body: 0x7fe4c6000400 body:{"status":"UP","customHealthCheck":{"status":"UP","health.count":13},"diskSpace" stopped: conn: 0x7fe4c50062e8 now:1522155115415657 start:1522155115413599 diff:2058 status: 200 parser: 0x7fe4c50062f0 body: 0x7fe4c6000c00 body:{"status":"UP","customHealthCheck":{"status":"UP","health.count":13},"diskSpace" stopped: conn: 0x7fe4c50083e0 now:1522155115419798 start:1522155115418591 diff:1207 status: 200 parser: 0x7fe4c50083e8 body: 0x7fe4c5800c00 body:{"status":"UP","customHealthCheck":{"status":"UP","health.count":13},"diskSpace" stopped: conn: 0x7fe4c500a4d8 now:1522155115424255 start:1522155115423011 diff:1244 status: 200 parser: 0x7fe4c500a4e0 body: 0x7fe4c5014e00 body:{"status":"UP","customHealthCheck":{"status":"UP","health.count":13},"diskSpace" stopped: conn: 0x7fe4c500c5d0 now:1522155115429655 start:1522155115428230 diff:1425 status: 200 parser: 0x7fe4c500c5d8 body: 0x7fe4c5015200 body:{"status":"UP","customHealthCheck":{"status":"UP","health.count":13},"diskSpace" stopped: conn: 0x7fe4c500e6c8 now:1522155115435156 start:1522155115433592 diff:1564 status: 200 parser: 0x7fe4c500e6d0 body: 0x7fe4c3013200 body:{"status":"UP","customHealthCheck":{"status":"UP","health.count":13},"diskSpace" stopped: conn: 0x7fe4c50107c0 now:1522155115438764 start:1522155115437429 diff:1335 status: 200 parser: 0x7fe4c50107c8 body: 0x7fe4c3013a00 body:{"status":"UP","customHealthCheck":{"status":"UP","health.count":13},"diskSpace" stopped: conn: 0x7fe4c50128b8 now:1522155115445146 start:1522155115443150 diff:1996 status: 200 parser: 0x7fe4c50128c0 body: 0x7fe4c6001400 body:{"status":"UP","customHealthCheck":{"status":"UP","health.count":13},"diskSpace" stopped: conn: 0x7fe4c5000000 now:1522155116411484 start:1522155115397215 diff:1014269 status: 200 parser: 0x7fe4c5000008 body: 0x7fe4c6800800 body:{"status":"UP","customHealthCheck":{"status":"UP","health.count":14},"diskSpace" timeout conn:0 connp: 0x7fe4c5000000 now:1522155117447507 start:1522155115397215 diff:2050292 timeout conn:1 connp: 0x7fe4c50020f8 now:1522155117447507 start:1522155115402504 diff:2045003 timeout conn:2 connp: 0x7fe4c50041f0 now:1522155117447507 start:1522155115408162 diff:2039345 timeout conn:3 connp: 0x7fe4c50062e8 now:1522155117447507 start:1522155115413599 diff:2033908

ckstettler commented 6 years ago

The duplicate output that I was focused on appears to be behavior of the server. Almost like it is caching. I do see access logs indicate quick response and it did not go into my java handler that does the incrementing. So this part of my tale looks like a red-herring. I still dont understand the timeout behavior but and still looking.

ckstettler commented 6 years ago

I figured it out. It is primarily related to the number of connections and the low rate. Because the connections are large relative to the rate each request sends very infrequently and the c->start from the last send triggers a timeout.

I added a c->start = 0 in response_complete() and added a check in check_timeouts() to ignore 0 then things worked fine.

ckstettler commented 6 years ago

I have updated the code to submit a PR if you agree with and are interested in the fix.

Geal commented 6 years ago

hello, here's a friendly ping about this. I think I encountered the same problem (as outlined in https://github.com/sozu-proxy/sozu/issues/424 ). From my tests (performed over the network and on loopback), it appears that at low rates, wrk2 is working in batches. Like, with ./wrk -L -d 20s -c 5 -t 1 -R 1 http://unhandledexpression.com it will send 5 requests in the first second, then wait 5 seconds, then send 5 new requests (in keep alive), then wait again 5 seconds.

jberryman commented 4 years ago

@ckstettler a PR would be appreciated even if author doesn't merge.

I wonder if there is a workaround? And is latency reported correctly when erroneous timeouts get reported?

giltene / wrk2

startup causing what appear to be false timeouts #56