networkservicemesh / sdk

Apache License 2.0
33 stars 36 forks source link

Race condition in client exclude prefixes #1674

Closed Ex4amp1e closed 1 month ago

Ex4amp1e commented 1 month ago

Expected Behavior

All clients should connect no NSE

Current Behavior

After enpoint healing there is a case where one of the clients can't get ip

Steps to Reproduce

  1. Restart endpoint several times
  2. Check ifconfig and clients logs - at one time one of the clients will get ip after healing normally, but the other will have expected ip included in excluded prefixes list, so he gets an ip and right after gets it removed in a cycle.

Context

Setup 2 clients and 1 endpoint

Looks like it is unnecessary, but the issue was reproduced with custom endpoint envs:

        - name: NSM_IPAM_POLICY
          value: strict
        - name: NSM_CIDR_PREFIX
          value: 172.16.1.100/27,2001:db8::/116

Failure Logs

Broken NSC:

Oct  1 10:08:26.868 [ERRO] [ExcludedPrefixesClient:Request] [cmd:[/bin/app]] Source or destination IPs are overlapping with excluded prefixes, srcIPs: [172.16.1.99/32 2001:db8::3/128], dstIPs: [172.16.1.98/32 2001:db8::2/128], excluded prefixes: [172.16.1.99/32 2001:db8::3/128 172.16.1.98/32 2001:db8::2/128], error: IP 172.16.1.99 is excluded, but it was found in response IPs
Oct  1 10:08:26.878 [ERRO] [cmd:[/bin/app]] policy failed: policies/common/tokens_expired.rego
Oct  1 10:08:26.882 [ERRO] [cmd:[/bin/app]] policy failed: policies/common/tokens_expired.rego
Oct  1 10:08:26.883 [WARN] [cmd:[/bin/app]] Environment variable NODE_NAME is not set. Skipping.
Oct  1 10:08:26.883 [WARN] [cmd:[/bin/app]] The label podName was already assigned to alpine-2. Skipping.
Oct  1 10:08:26.883 [WARN] [cmd:[/bin/app]] Environment variable CLUSTER_NAME is not set. Skipping.

Cluster dump: dump-policy.zip

Ex4amp1e commented 1 month ago

Plan:

  1. Write unit test to reproduce the issue
  2. Provide fix
  3. Check other tests
denis-tingaikin commented 1 month ago

Here we should use 2 clients https://github.com/networkservicemesh/deployments-k8s/tree/main/examples/features/ipam-policies

Ex4amp1e commented 1 month ago

Previous plan has been done:

TODO:

@denis-tingaikin PTAL