google / seesaw

Seesaw v2 is a Linux Virtual Server (LVS) based load balancing platform.
Apache License 2.0
5.63k stars 511 forks source link

use single connection in ncc_client #76

Closed liuyuan10 closed 4 years ago

liuyuan10 commented 4 years ago

ncc_client was dialing and closing in a circle that caused failure when large amounts of vservers were all calling ncc_client.

This commit will only connect ncc server once and reuse the same single connection for all ipc calls from engine. Tests show that it's able to handle 1000 vservers.

Fixed #48

anfernee commented 4 years ago

If we use single connection, we need to deal with connection reset and redial. One clarifying question: the reason we have separate ncc process is to grant different permission to different processes? Trying to understand the design here.

liuyuan10 commented 4 years ago

@anfernee Because it's a local unix socket connection, it's pretty stable. I've been running 1000 services for days and haven't notice a failure. Even if it happens the seesaw_engine will restart so we are not in a dead state.

the ultimate fix is migrating to gRPC that provides connection management which requires bigger refactor to migrate everything to protobuffer.

anfernee commented 4 years ago

can you show where we enforce the exit/panic, or it's on the caller? If one usage of ncc_client didn't exit/panic on error, what would happen? @liuyuan10

liuyuan10 commented 4 years ago

@anfernee seesaw is written in a way that it fatals on those errors. For example:

https://github.com/google/seesaw/blob/master/engine/vserver.go#L880

liuyuan10 commented 4 years ago

updated the comments to make it clear the about the connection disruptions in ncc_client Also added DrJosh9000 to review

DrJosh9000 commented 4 years ago

The change seems good, but I want to know more about the failure that led to this - surely it's not hitting a limit on concurrent connections to a Unix socket?

To me, looks like the simplest fix to #48 would be to move e.ncc.Close() from a defer to a direct call after sending the gratuitous ARP.

liuyuan10 commented 4 years ago

@DrJosh9000 The real issue it's fixing is not #48

It's when 1000 vserver threads keeps dailing and closing connections which chokes the kernel and finally returns temporary unavailable errors.