google / seesaw

Seesaw v2 is a Linux Virtual Server (LVS) based load balancing platform.
Apache License 2.0
5.63k stars 511 forks source link

Retry doesn't really happen when RPC health state update failed #37

Closed unicell closed 6 years ago

unicell commented 6 years ago

This is not so obvious from healthcheck logs, but each time healthcheck component failed to update state back to the engine through RPC call, it always fails 11 times in a row and eventually bails out.

I1107 23:16:05.285266    2552 core.go:503] Getting healthchecks from engine...
I1107 23:16:05.286748    2552 core.go:509] Engine returned 2 healthchecks
E1107 23:16:06.294586    2552 core.go:590] Send failed: read unix @->/var/run/seesaw/engine/engine.sock: i/o timeout
E1107 23:16:08.294828    2552 core.go:590] Send failed: read unix @->/var/run/seesaw/engine/engine.sock: i/o timeout
E1107 23:16:10.295082    2552 core.go:590] Send failed: read unix @->/var/run/seesaw/engine/engine.sock: i/o timeout
I1107 23:16:12.243872    2552 core.go:317] ID 0x7000000000001: (TCP 10.5.52.160:80 DSR (via 10.220.22.33 mark 65536)) FAILURE: Timed out
I1107 23:16:12.293796    2552 core.go:317] ID 0x7000000000000: (TCP 10.5.52.31:443 DSR (via 10.220.22.33 mark 65536)) FAILURE: Timed out
E1107 23:16:12.295279    2552 core.go:590] Send failed: read unix @->/var/run/seesaw/engine/engine.sock: i/o timeout
E1107 23:16:14.295485    2552 core.go:590] Send failed: read unix @->/var/run/seesaw/engine/engine.sock: i/o timeout
E1107 23:16:16.295747    2552 core.go:590] Send failed: read unix @->/var/run/seesaw/engine/engine.sock: i/o timeout
E1107 23:16:18.295980    2552 core.go:590] Send failed: read unix @->/var/run/seesaw/engine/engine.sock: i/o timeout
I1107 23:16:20.287014    2552 core.go:503] Getting healthchecks from engine...
I1107 23:16:20.288315    2552 core.go:509] Engine returned 2 healthchecks
E1107 23:16:20.296185    2552 core.go:590] Send failed: read unix @->/var/run/seesaw/engine/engine.sock: i/o timeout
F1107 23:16:20.296236    2552 core.go:580] send: 11 errors, giving up
unicell commented 6 years ago

Closing the issue as the fix merged in https://github.com/google/seesaw/commit/34716af0775ecb1fad239a726390d63d6b0742dd