etcd Find seems unreliable when watch is enabled

I have some questions and possible concerns with etcdNSERegistryServer's Find feature when the query has watch enabled.

My goal would be to monitor creation and deletion of all NSEs in a particular network service and handle the NSM connections independently for each NSE.

For this, we rely on NetworkServiceEndpointRegistryClient.Find to create a stream to read events from.

1., If I got it right, this stream will get closed in 60 seconds latest due to how etcdNSERegistryServer's watch() function creates the Watch object. Meaning the user must call NetworkServiceEndpointRegistryClient.Find again periodically.

My concerns with this approach is, that it does not look feasible when the goal is to track NSE removals reliably. Because an NSE might unregister in that gap when the the previous stream is already closed but the new one is not established yet.

I guess there can be a similar gap when the user stream gets terminated because the local nsmgr disappears e.g. due to upgrade. (The NSE might have run on a different worker than the user watching it.) Similarly, the registry POD serving a watch might disappear as well.

2., Find allows removal of expired NSEs by checking their expiration time when listing them from etcd. If this Find call that takes care of such NSE removal has a request with watch enabled, the user won't be notified about the removal because the watch part is not running yet, and no Deleted response is sent on the stream when doing removal.

3., etcdNSERegistryServer's handleWatcher() function enforces a weird version check between the etcd NSE event and the locally stored NSE item. And skips events if the above two match.

This causes a really weird behaviour: There will be a race between active Find watch calls and Registers (the latter updating the locally stored version of NSE). Thus in case of multiple active Find watch calls, some of the watchers might skip events that others might accept depending on when the local version will get updated in the sync Map.

I fail to see the purpose if this version check. Maybe the intention was to implement some kind of damper not to send too many events, but IMHO it might be a faulty design. There could be dozens of Find streams parallel with watch enabled, and they all must be informed anyways. Luckily, due to the 1 minute timeout of the Watcher object the Find stream only lives for a minute tops, thus when the user calls Find the next time and the code fetches the NSEs using List action then they finally can be informed (although with some delay) about available NSE.

Btw., the number k8s-registry replicas can add an extra funkiness depending where NSE Registers and Finds are served.

Fortunately, removal does not seem to be affected by this version check, because Delete changes the resourceVersion.

Summary:

As a summary - if I haven't misinterpreted sg. - in case of multiple watcher some users watching NSEs might miss removal notifications due to 1. and 2. And an issue of minor importance can occur due to 3. where create notification might be received with some delay. Thus, for me it seems if I wanted to reliably keep track of NSE removals, my code should execute an additional Find call without watch to get hold of the current set of available NSEs whenever my watch stream returns so to say. (And compare that set against the NSEs my code is aware of.)

networkservicemesh / sdk-k8s

etcd Find seems unreliable when watch is enabled #512

Summary: