networkservicemesh / sdk-k8s

This repo is for Kubernetes SDK code
Other
2 stars 14 forks source link

etcd Find seems unreliable when watch is enabled #512

Open zolug opened 1 month ago

zolug commented 1 month ago

I have some questions and possible concerns with etcdNSERegistryServer's Find feature when the query has watch enabled.

My goal would be to monitor creation and deletion of all NSEs in a particular network service and handle the NSM connections independently for each NSE.

For this, we rely on NetworkServiceEndpointRegistryClient.Find to create a stream to read events from.

1., If I got it right, this stream will get closed in 60 seconds latest due to how etcdNSERegistryServer's watch() function creates the Watch object. Meaning the user must call NetworkServiceEndpointRegistryClient.Find again periodically.

My concerns with this approach is, that it does not look feasible when the goal is to track NSE removals reliably. Because an NSE might unregister in that gap when the the previous stream is already closed but the new one is not established yet.

I guess there can be a similar gap when the user stream gets terminated because the local nsmgr disappears e.g. due to upgrade. (The NSE might have run on a different worker than the user watching it.) Similarly, the registry POD serving a watch might disappear as well.

2., Find allows removal of expired NSEs by checking their expiration time when listing them from etcd. If this Find call that takes care of such NSE removal has a request with watch enabled, the user won't be notified about the removal because the watch part is not running yet, and no Deleted response is sent on the stream when doing removal.

3., etcdNSERegistryServer's handleWatcher() function enforces a weird version check between the etcd NSE event and the locally stored NSE item. And skips events if the above two match.

This causes a really weird behaviour: There will be a race between active Find watch calls and Registers (the latter updating the locally stored version of NSE). Thus in case of multiple active Find watch calls, some of the watchers might skip events that others might accept depending on when the local version will get updated in the sync Map.

I fail to see the purpose if this version check. Maybe the intention was to implement some kind of damper not to send too many events, but IMHO it might be a faulty design. There could be dozens of Find streams parallel with watch enabled, and they all must be informed anyways. Luckily, due to the 1 minute timeout of the Watcher object the Find stream only lives for a minute tops, thus when the user calls Find the next time and the code fetches the NSEs using List action then they finally can be informed (although with some delay) about available NSE.

Btw., the number k8s-registry replicas can add an extra funkiness depending where NSE Registers and Finds are served.

Fortunately, removal does not seem to be affected by this version check, because Delete changes the resourceVersion.

Summary:

As a summary - if I haven't misinterpreted sg. - in case of multiple watcher some users watching NSEs might miss removal notifications due to 1. and 2. And an issue of minor importance can occur due to 3. where create notification might be received with some delay. Thus, for me it seems if I wanted to reliably keep track of NSE removals, my code should execute an additional Find call without watch to get hold of the current set of available NSEs whenever my watch stream returns so to say. (And compare that set against the NSEs my code is aware of.)

zolug commented 1 month ago

Btw, I don't think my idea to simply fetch the list of available NSEs with a second Find call (without watch) could work without problems.

That's because IMHO the user can't really tell if it received all NSEs e.g. when the server had closed the user stream because the server lost its connection to the registry. The helper function registry.ReadNetworkServiceChannel simply transfers responses from the registry side stream but is not bothered by informing the user about non EOF error. (Also, no data count information is forwarded at the start.) So, it's hard to tell if an NSE have been really unregistered.

NSMgr crash or NSMgr POD delete seems to be covered, as the user side stream at least will get a non EOF error in such cases. (But maybe in case of graceful termination this is mere luck due to the current implementation details.)