networkservicemesh / sdk

Apache License 2.0
33 stars 36 forks source link

Proper resource cleanup for expired connections #1445

Open BenAgai opened 1 year ago

BenAgai commented 1 year ago

Hi, I wanted to ask a question regarding cleanup of expired connections made from an NSC to NSE.

Assuming NSC successfully established connection to an NSE. If the NSC runs on a pod that got terminated without the NSC initiating close flow, we will have leak across the data path for that connection.

Is there any built in mechanism the recover from such scenario? (for example expiration on a connection that is renewed every time refresh occurs, and once the connection is expired a close flow is initiated for it).

I encountered the following links that looks relevant, but wasn’t sure: https://github.com/networkservicemesh/deployments-k8s/issues/8882 https://github.com/networkservicemesh/sdk/pull/1439 https://github.com/networkservicemesh/sdk/pull/1440

I will be glad to know if the issue is resolved and how, or whether should I implement such mechanism. Thanks!

glazychev-art commented 1 year ago

Hi @BenAgai, That's right, we have such functionality. There is timeout chain element that will call Close for inactive connections. This happens after the token expires (default 10 minutes). If the connection is alive, then refreshes occur periodically, and this doesn't allow the connection to expire. If the NSC didn't initiate Close when the pod ended, this Close will occur when the timeout occurs.

BenAgai commented 1 year ago

Hi @glazychev-art , Thank you very much for the update!

or-adar commented 1 year ago

@glazychev-art I'm reusing this thread as I have some questions about the timeout chain element which are unclear to me: if I don't want to use the default 10 minutes and want to use a lower value, like 1 minue - what should be considered? I mean, if the timeout element only considers the expiration time of it's previous path segment - should I align the env of NSM_MAX_TOKEN_LIFETIME for all the components in the path (NSC, NSMgr, forwarders, passthrough, NSEs), so the new value is same for all, or are there particular components that should be enough to only set their values (while keeping the default value for others)?

glazychev-art commented 1 year ago

@or-adar You are right, currently the timeout element only considers it's previous path segment. If the timeout calls Close - this call goes though all components.

So, for example, if you set NSM_MAX_TOKEN_LIFETIME=1m only for NSC, then NSMgr calls Close after 1m (if refreshes do not take place regularly). Other components (forwarders, passthrough, NSEs) will also receive this Close from the NSMgr, but not from its timeout element (because they use NSM_MAX_TOKEN_LIFETIME=10m).

For the normal case, this is sufficient. Some questions may arise in the case of healing. For example, NSC and its NSMgr die. In this case other components will clean up their resources only on their own timeouts (i.e after 10m). Because the only one (NSMgr) who was responsible for clearing resources after 1 minute died.

or-adar commented 1 year ago

@glazychev-art appreciate the response! so if I got you right - if I want NSEs & Forwarders to clean up their resources after 1m, it is more recommended to set the NSM_MAX_TOKEN_LIFETIME to 1m for all components that might be included in the PathSegment (NSC, NSMgr, forwarders, passthrough, NSEs), to accomodate for cases where components die?

glazychev-art commented 1 year ago

Yes, that's right