Closed liuyuan10 closed 4 years ago
If we use single connection, we need to deal with connection reset and redial. One clarifying question: the reason we have separate ncc process is to grant different permission to different processes? Trying to understand the design here.
@anfernee Because it's a local unix socket connection, it's pretty stable. I've been running 1000 services for days and haven't notice a failure. Even if it happens the seesaw_engine will restart so we are not in a dead state.
the ultimate fix is migrating to gRPC that provides connection management which requires bigger refactor to migrate everything to protobuffer.
can you show where we enforce the exit/panic, or it's on the caller? If one usage of ncc_client didn't exit/panic on error, what would happen? @liuyuan10
@anfernee seesaw is written in a way that it fatals on those errors. For example:
https://github.com/google/seesaw/blob/master/engine/vserver.go#L880
updated the comments to make it clear the about the connection disruptions in ncc_client Also added DrJosh9000 to review
The change seems good, but I want to know more about the failure that led to this - surely it's not hitting a limit on concurrent connections to a Unix socket?
To me, looks like the simplest fix to #48 would be to move e.ncc.Close()
from a defer
to a direct call after sending the gratuitous ARP.
@DrJosh9000 The real issue it's fixing is not #48
It's when 1000 vserver threads keeps dailing and closing connections which chokes the kernel and finally returns temporary unavailable errors.
ncc_client was dialing and closing in a circle that caused failure when large amounts of vservers were all calling ncc_client.
This commit will only connect ncc server once and reuse the same single connection for all ipc calls from engine. Tests show that it's able to handle 1000 vservers.
Fixed #48