Investigate why forwarder could restart too long

denis-tingaikin commented 2 years ago

Steps to reproduce

Repeat steps from https://drive.google.com/file/d/1Pwd7T6iYPItXEqjx_Yd5NfPE-T0zgUy5

glazychev-art commented 2 years ago

Most likely, this behavior can be explained and there are actually no outliers here. As I understand, we are concerned about the page 11 of the document - "Upgrade from NSM v1.2.0 to v1.4.0" - forwarder-vpp-6t7x7

The forwarder will be in a ready state only when it can serve GRPC-clients. We have the forwarder readinessProbe: command: ["/bin/grpc-health-probe", "-spiffe", "-addr=unix:///listen.on.sock"] But before that, it needs to get certificates from the spire-agent - usually it takes about 3-5 sec.

On page 11 we can see that the upgrade of the NSM is happening gradually including spire-agents. But what if the forwarder upgrade happened before the spire-agent? Let's take a look: Screenshot from 2022-07-04 14-49-24_2

spire-agent-gqgr2 and the forwarder-vpp-6t7x7 may be on the same node. I think the difference between the readiness of pods (about 3 seconds) is understandable (besides, we use periodSeconds: 1)

szvincze commented 1 year ago

Coming back to this topic after long time again...

Since the presentation there were several changes in our configuration that made the upgrade faster. But the question is still valid about the forwarder startup time or mainly the traffic outage period.

In the worker node reboot and forwarder-vpp pod restart scenarios we observe quite big differences regarding the period of time while the traffic does not work. As I checked the startup time of the forwarder-vpp pods after worker node reboot and forwarder-vpp pod restart. The forwarder's startup phases completed within 4-5 seconds which seems to be normal. In a small and simple configuration it takes additional few seconds to handle traffic again. But it sometimes takes just 10 seconds from the pod restart, sometimes it takes ~20 seconds and sometimes even more than 30 seconds.

We also measured it in a bigger environment where we got various figures between 60 and 120 seconds or even more.

Is there a theory that could explain these huge differences?

szvincze commented 11 months ago

I managed to reproduce the issue when we restart a forwarder then the traffic cannot be recovered just a few minutes later. After some investigation it turned out when the forwarder restarted, only one of the NSM interfaces was re-created in the NSC pod, therefore the traffic failed for the affected network service.

The setup is based on NSM v1.11.1 basic kernel to ethernet to kernel example. I used a kind cluster with 4 worker nodes. I deployed 4 NSCs, one for each node and 2 NSEs (one for IPv4 and one for IPv6), their location does not matter. I added a container with ctraffic tool included into NSC and NSE pods, for this purpose I used registry.nordix.org/cloud-native/nsm/rvm-tester:ctraffic image.

Then I started two ctraffic servers in the NSC pods. For example nsc-kernel-5f485bf7c-8gz6c:

ctraffic -server -address 172.16.1.1:5003&
ctraffic -server -address [100:100::1]:5003&

On the NSEs I ran ctraffic as clients for the matching servers on the NSCs, like below:

kubectl exec pods/nse-kernel4-5dd67f7f46-pjrn8 -n ns-kernel2ethernet2kernel -c ctraffic -- ctraffic -address 172.16.1.1:5003 -monitor -nconn 100 -rate 50 -timeout 600s

These clients send traffic to the addressed server via 100 connections with rate of 50 KB/s for 10 minutes. (It can happen that sometimes the ctraffic client reaches the reconnect limit [1000] and needs to be restarted.)

When all the instances are running and connected properly, and the traffic is steady (no failing or connecting state, and no drops), the tool shows similar output:

Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25519/24819/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25567/24867/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25619/24919/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25667/24967/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25719/25019/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25767/25067/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25819/25119/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25867/25167/0

Then I started deleting the forwarder pods one by one in a cycle. I always wait until the traffic is steady again after the new forwarder started up then delete the next one and check if the traffic recovered, like below:

Conn act/fail/connecting: 59/541/41, Packets send/rec/dropped: 8405/7806/0
Conn act/fail/connecting: 1/599/99, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 100/600/0, Packets send/rec/dropped: 8426/7826/0
Conn act/fail/connecting: 100/600/0, Packets send/rec/dropped: 8472/7872/0
Conn act/fail/connecting: 100/600/0, Packets send/rec/dropped: 8526/7926/0
Conn act/fail/connecting: 100/600/0, Packets send/rec/dropped: 8572/7972/0

After some iterations the traffic to one of the NSCs did not get connected just after few minutes. As I checked the NSM interfaces in the NSC I realized that the one for IPv6 traffic was missing. In another successful reproduction attempt the same happened but with the one for IPv4 service. And after few minutes the interface was recreated and the traffic connected again.

I tried to reproduce it with less nodes, less NSEs, also with ping instead of ctraffic without success, however most probably there are more ways to catch the same issue.

I collected some logs but unfortunately the others were already rotated.

@denis-tingaikin: Can you please check if there could be a fault in the healing mechanism that causes this behavior?

glazychev-art commented 11 months ago

@szvincze I looked at the logs. Question: is NSM_LIVENESSCHECK enabled for NSC?

szvincze commented 11 months ago

@szvincze I looked at the logs. Question: is NSM_LIVENESSCHECK enabled for NSC?

No, it is disabled in this case.

glazychev-art commented 11 months ago

@szvincze We found and fixed one potential problem. To test it you should use the latest cmd-nsc image: ghcr.io/networkservicemesh/ci/cmd-nsc:a99d3e4

Most likely for the problem you described this will be enough. If the problem still exists, then you can also try updating NSMgr and Forwarder-vpp (ghcr.io/networkservicemesh/ci/cmd-nsmgr:976d9e6 ghcr.io/networkservicemesh/ci/cmd-forwarder-vpp:fe97eed)

szvincze commented 11 months ago

@glazychev-art: Our tests results shows the fix is working fine. Do you think the additional NSMgr and forwarder-vpp fixes are also needed for some reason? What do they improve?

glazychev-art commented 11 months ago

Cool, thanks! I think yes. They improve the transmission of the initial monitor event from component to component.

szvincze commented 10 months ago

Could there be backward compatibility issue if the forwarder-vpp and nsmgr contain this update but the NSC does not?

denis-tingaikin commented 10 months ago

I quickly checked and found that we missed the commit with the fix in the release candidate v1.11.2-rc.1 https://github.com/networkservicemesh/sdk/compare/release/v1.11.1...release/v1.11.2-rc.1

The good news is that it was just a release candidate.

I'll create v1.11.2-rc.2 in a few hours with the missed fix.

denis-tingaikin commented 10 months ago

@szvincze

Now we have the missed 530e439255e773c2b92bc7fb8ccedba1a38e188a commit https://github.com/networkservicemesh/sdk/compare/release/v1.11.1...release/v1.11.2-rc.2

Release v1.11.2-rc.2 is ready https://github.com/networkservicemesh/deployments-k8s/tree/release/v1.11.2-rc.2

Testing is in progress https://github.com/networkservicemesh/integration-k8s-kind/actions/runs/7078369907

szvincze commented 10 months ago

It was really quick, thanks!

My original question is still valid. Does it make any problem if the NSC will be upgraded after NSM system components?

glazychev-art commented 10 months ago

There will be no backward compatibility issues, but the original problem will not be fixed - the connection will still take a time to recover.

szvincze commented 10 months ago

Thanks. It was clear that to fix the original problem we need the new NSC but it is important to keep backward compatibility.

szvincze commented 10 months ago

Hi @glazychev-art & @denis-tingaikin,

It seems that the problem still occurs with rc.2. However the previous test image (ghcr.io/networkservicemesh/ci/cmd-nsc:a99d3e4) solved the issue.

Can you please double-check what could cause this behavior with rc.2?

glazychev-art commented 10 months ago

Hi @szvincze, It would be cool if you could attach the logs. Because rc2 has the required fix, as far as I can see

denis-tingaikin commented 10 months ago

Hello @szvincze

If ghcr.io/networkservicemesh/ci/cmd-nsc:a99d3e4 solves the problem, then rc.1 contains a needed fix. The problem on our side is that we can't test it.

CI image that you mentioned is based on this SDK https://github.com/networkservicemesh/cmd-nsc/commit/a99d3e4a24da256887744fe5af1abd1f4fd57580 -> https://github.com/networkservicemesh/sdk/commit/cb5a6d2a7a87fc3ac4324ae58460a7b6d4481056
v1.11.2-rc.1 images contain this fix https://github.com/networkservicemesh/sdk/compare/release/v1.11.1...release/v1.11.2-rc.1

Do you have results for v1.11.2-rc.1?

szvincze commented 10 months ago

Hi @denis-tingaikin,

Yes, it seems that the issue is reproducible with rc.1 too. However today I got some confusing news. Previously the forwarder restart tests went well with the specific NSC image that @glazychev-art created, but now that also shows the symptoms. I should not here that I did not manage to reproduce the issue either when used that image or with rc.1 and rc.2. In my environment I only use pure NSM components. So, I will provide logs from my colleagues environment soon. I can imagine that could be another corner case of a corner case...

denis-tingaikin commented 10 months ago

I can imagine that could be another corner case of a corner case...

Yeah.. It could be another problem, we can prove it with logs.

So, I will provide logs from my colleagues environment soon.

Perfect!

So I think our plan for now is

Double check that ghcr.io/networkservicemesh/ci/cmd-nsc:a99d3e4 solves the problem.
Attach logs from v1.11.2-rc.2 images.

szvincze commented 10 months ago

Here I attach the logs. Note that ghcr.io/networkservicemesh/ci/cmd-nsc:a99d3e4 NSC image was used for this test, for the rest of the NSM images came from v1.11.2-rc.2.

The setup is similar to the one in my previous comment.

The forwarder instance was restarted on node 6 at 2023-12-04 22:34:08 which took 32 seconds, meaning the new one came up around 22:34:40.

Most of the connections were restored except for one traffic instance (see nsc-76598446f6-rsnzl.txt in the attached logs), and it recovered after more than 7 minutes.

denis-tingaikin commented 10 months ago

Am I getting it correct that previously we got false positive results?

szvincze commented 10 months ago

Am I getting it correct that previously we got false positive results?

Actually it seems the situation is better than before, but there are still attempts of this test case when it is failing and the recovery of the traffic takes few minutes (most probably until the token's lifetime expires).

denis-tingaikin commented 10 months ago

I've started looking into logs and have a few questions.

Q1: Are you using your own images based on the NSM?

if so, Are you using ENTRYPOINT for the forwarder based Dockerfile https://github.com/networkservicemesh/cmd-forwarder-vpp/blob/main/Dockerfile#L40?

Correct k8s signal handling could reduce the time of restarts for forwarders.

Q2: Also, is it possible to share describes of forwarders/endpoints/clients with removing all secret details?

szvincze commented 10 months ago

Hi @denis-tingaikin,

Thanks for checking the logs.

Note that even if the issue was about forwarder slow restarts, since then it turned out that the forwarder restarts are quite fast, but sometimes one (or more) traffic instance cannot recover its traffic after the forwarder deleted and a new one started. In this case there are two clients on the same node and only one of them is suffering from traffic outage for several minutes, the traffic restored for the other one in few seconds after the new forwarder started. So, it is something with the healing and not really with forwarder itself.

Q1: Are you using your own images based on the NSM?

Yes, the images are built from the original NSM source.

Q2: Also, is it possible to share describes of forwarders/endpoints/clients with removing all secret details?

Here are the describes, one for each kind: forwarder-vpp.yaml nsc.yaml nse-ipv4.yaml

denis-tingaikin commented 10 months ago

Hello @szvincze

I've found a few problems with the current healing implementation. I'll create a proposal for simplifying and reworking the healing soon.

For this moment, we've merged a workaround that stabilises the case for me. https://github.com/networkservicemesh/sdk/pull/1562

What I can suggest at this moment is:

Most of the connections were restored except for one traffic instance (see nsc-76598446f6-rsnzl.txt in the attached logs), and it recovered after more than 7 minutes.

This problem can be fixed by these options:

Enable datapath healing for NSC https://github.com/networkservicemesh/cmd-nsc/blob/main/internal/config/config.go#L52 (this could work with v1.11.1 and 1.11.2-rc.1 and 1.11.2-rc.2)
Use latest images: - latest NSMgr https://github.com/networkservicemesh/cmd-nsmgr/pkgs/container/ci%2Fcmd-nsmgr/156648018?tag=3e4be55 - latest NSE https://github.com/networkservicemesh/cmd-nse-icmp-responder/pkgs/container/ci%2Fcmd-nse-icmp-responder/156651169?tag=85732d5 - Note: if you're using custom NSE make sure that you have a patch for point2pointipam https://github.com/networkservicemesh/sdk/pull/1562 - Also, use a subnet with a capacity of more than 2 addresses (for example, /30 or /29) on NSE to prevent https://github.com/networkservicemesh/sdk/issues/1020

The forwarder instance was restarted on node 6 at 2023-12-04 22:34:08 which took 32 seconds, meaning the new one came up around 22:34:40.

I feel the problem might be related to incorrect k8s signals handling https://medium.com/@tasdikrahman/handling-signals-for-applications-running-in-kubernetes-dc6537f9b542 Forwarder-vpp https://github.com/networkservicemesh/cmd-forwarder-vpp/pkgs/container/cmd-forwarder-vpp/153981348?tag=v1.11.2-rc.2 restars in 3-4 seconds on k8s clusters.

Could you please double check that your images are correct in handling K8S os signals?

szvincze commented 10 months ago

Hi @denis-tingaikin,

Thanks for your comment.

As we discussed only the second option is feasible for us right now.

The forwarder instance was restarted on node 6 at 2023-12-04 22:34:08 which took 32 seconds, meaning the new one came up around 22:34:40.

I feel the problem might be related to incorrect k8s signals handling https://medium.com/@tasdikrahman/handling-signals-for-applications-running-in-kubernetes-dc6537f9b542 Forwarder-vpp https://github.com/networkservicemesh/cmd-forwarder-vpp/pkgs/container/cmd-forwarder-vpp/153981348?tag=v1.11.2-rc.2 restars in 3-4 seconds on k8s clusters.

Could you please double check that your images are correct in handling K8S os signals?

Don't get me wrong. It is true that the forwarder pod itself restarts in few seconds, but the traffic instance cannot connect just after the period I mentioned. When I wrote the forwarder instance performed its restart in 32 seconds, that means the time between the test case ordered the restart and when the new forwarder came up and ran properly. We don't see problem with this behavior, just the traffic recovery is in question.

denis-tingaikin commented 10 months ago

Don't get me wrong. It is true that the forwarder pod itself restarts in few seconds, but the traffic instance cannot connect just after the period I mentioned. When I wrote the forwarder instance performed its restart in 32 seconds, that means the time between the test case ordered the restart and when the new forwarder came up and ran properly. We don't see problem with this behavior, just the traffic recovery is in question.

I see. We expect that these problems have been fixed in the release: https://github.com/networkservicemesh/deployments-k8s/releases/tag/v1.11.2-rc.3

Let us know about the results, and if something went wrong, please share logs from nsc/nsmgr/forwarder/nse. Thanks!

szvincze commented 10 months ago

Hi @denis-tingaikin,

Here I attach the logs from the test with NSM v1.11.2-rc.3. In this case the issue affected nsc-774cd45840fvcm7. Both interfaces nsm-ipv4 and nsm-ipv6 were missing.

As we discussed via slack, we reverted to RC2 since it showed better results.

denis-tingaikin commented 10 months ago

As I can see from the logs, we have a problem with the registry related to dial

 failed to dial nsm-registry-svc:5002: context deadline exceeded

I'd like suggest to re-test rc.3 with two options:

increase NSM_DIAL_TIMEOUT from 100ms to 300ms for nsmgr.
increase upper bound nsmgr CPU limits in 1.25x times.

In parallel, I'll try to reproduce the problem and make sure that we don't have other issues that may produce the problem with dial.

denis-tingaikin commented 10 months ago

FYI: Also, if you are restarting the node that contains spire-server, please use this patch to avoid problems with dialling to the registry: https://github.com/networkservicemesh/deployments-k8s/pull/10287

denis-tingaikin commented 8 months ago

Current status:

Based on the latest testing results, we considered adding retry for failed refresh requests: https://github.com/networkservicemesh/sdk/issues/1565.
We're waiting for logs from running v1.12.0-rc.1 and v1.12.0-rc.2.
As soon as we complete p.1, we will schedule a new release, v1.12.1-rc.1

/cc @edwarnicke , @szvincze

szvincze commented 8 months ago

Hi @denis-tingaikin,

We're waiting for logs from running v1.12.0-rc.1 and v1.12.0-rc.2.

Please find the logs for v1.12.0-rc.1.

Some background about what happened in this test.

Traffic started on the problematic connection at 2024-01-11T09:26:37.541, then forwarder-vpp-cpqz7 deleted at 2024-01-11T09:26:40.600. The new forwarder is ready at 2024-01-11T09:26:49.

After the forwarder deleted this connection did not get packets for almost 10 minutes.

Additional information on the problematic NSE.

[2024-01-11T09:26:34.518Z] DEBUG: Interfaces in nse-ipv4-c47b5df69-hv56b:
[2024-01-11T09:26:34.518Z] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
[2024-01-11T09:26:34.518Z]     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2024-01-11T09:26:34.518Z]     inet 127.0.0.1/8 scope host lo
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 ::1/128 scope host 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z] 3: eth0@if518: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default 
[2024-01-11T09:26:34.518Z]     link/ether 12:0d:9b:ef:79:cb brd ff:ff:ff:ff:ff:ff link-netnsid 0
[2024-01-11T09:26:34.518Z]     inet 192.168.127.208/32 scope global eth0
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fd00:eccd:18:ffff:a7ff:3fda:6f8a:7fd0/128 scope global 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fe80::100d:9bff:feef:79cb/64 scope link 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z] 4: icmp-respo-58d2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1446 qdisc mq state UNKNOWN group default qlen 1000
[2024-01-11T09:26:34.518Z]     link/ether 02:fe:6a:2d:70:b4 brd ff:ff:ff:ff:ff:ff
[2024-01-11T09:26:34.518Z]     inet 172.16.1.96/32 scope global icmp-respo-58d2
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fe80::fe:6aff:fe2d:70b4/64 scope link 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z] 5: icmp-respo-8db4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN group default qlen 1000
[2024-01-11T09:26:34.518Z]     link/ether 02:fe:e2:42:cf:42 brd ff:ff:ff:ff:ff:ff
[2024-01-11T09:26:34.518Z]     inet 172.16.1.98/32 scope global icmp-respo-8db4
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fe80::fe:e2ff:fe42:cf42/64 scope link 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z] 6: icmp-respo-71f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1446 qdisc mq state UNKNOWN group default qlen 1000
[2024-01-11T09:26:34.518Z]     link/ether 02:fe:d0:e0:7a:9d brd ff:ff:ff:ff:ff:ff
[2024-01-11T09:26:34.518Z]     inet 172.16.1.100/32 scope global icmp-respo-71f0
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fe80::fe:d0ff:fee0:7a9d/64 scope link 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z] 7: icmp-respo-47c5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1446 qdisc mq state UNKNOWN group default qlen 1000
[2024-01-11T09:26:34.518Z]     link/ether 02:fe:12:32:a5:af brd ff:ff:ff:ff:ff:ff
[2024-01-11T09:26:34.518Z]     inet 172.16.1.102/32 scope global icmp-respo-47c5
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fe80::fe:12ff:fe32:a5af/64 scope link 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever

The problematic connection to the traffic server address [100:100::1]:5003 in the NSC.

glazychev-art commented 8 months ago

Hi @szvincze Could you please clarify whether the problem was with IPv4 or IPv6? I'm asking because you attached the output from nse-ipv4-c47b5df69-hv56b but then mentioned 100:100::1.

szvincze commented 8 months ago

Could you please clarify whether the problem was with IPv4 or IPv6? I'm asking because you attached the output from nse-ipv4-c47b5df69-hv56b but then mentioned 100:100::1.

The problem in this case was with IPv6 and most probably I got the IPv4 output by mistake. However it is not specific to IPv4 or 6, it happens with both in different reproduction attempts.

zolug commented 8 months ago

Hi, Here are some details about the network setup in use where the nsm default datapath healing cannot be used. Our problem is two fold: it is related to a linux bridge and on the other hand to capability restrictions in certain NSCs.

1., Consider the folliwing setup: NSC --- NSE/NSC --- NSE

In the NSE/NSC POD there's a linux bridge (with an IPv4/IPv6 address) to which the nsm interface(s) will be attached. (IPs are assigned from the same IP network for both the NSM interfaces and the bridge.)

Ping between the NSM endpoint IPs belonging to the same NSM connection might not work, or more like won't work in our case:

IPv4: Due to hardening arp_ignore is set to 1 by default in PODs, preventing arp resolution to work with the bridge setup. Can be easily resolved by setting arp_ignore=0 in NSC/NSE POD.

IPv6: https://www.linuxquestions.org/questions/linux-networking-3/how-do-i-enable-proxy-ndp-proxy-arp-works-933174/

Requires proxy_ndp enabled, but is more complicated than IPv4, since proxy ndp must be handled on per address bases (for example by using an additional chain element to add/remove ndp entries to/from the ND table upon NSM Request/Close). For example:

sysctl -w net.ipv6.conf.bridge0.proxy_ndp=1 ip -6 neigh add proxy fd00::5 dev bridge0

These "adjustments" for IPv4 and IPv6 could be skipped for example by using the bridge IP instead for the datapath monitoring. (Either as src or dst depending which link we consider.)

2., Sending ping as part of datapath monitoring is not feasible from the leftmost NSC in our case as it belongs to the application.

In order to send an ICMP echo request, you either need a RAW socket or a Datagram socket (UDP Ping, often referred to as unprivileged ping). To create RAW socket NET_RAW privilege is required. While for the UDP Ping the GID of the process must be authorized to create such special Datagram sockets. This is achieved through ping_group_range sysctl.

ping_group_range - 2 INTEGERS Restrict ICMP_PROTO datagram sockets to users in the group range. The default is "1 0", meaning, that nobody (not even root) may create ping sockets. Setting it to "100 100" would grant permissions to the single group. "0 4294967294" would enable it for the world, "100 4294967294" would enable it for the users, but not daemons.

Unfortunately we are not allowed to have requirements involving capabilities or setting ping_group_range sysctl towards the application POD. I had an idea to basically mimic port scanners. That is to replace the datapath monitor with a UDP sender and rely on port unreachable to check peer availability. Of course the port in use had to be coordinated. (But it would scale well when number of connected NSC are increased without imposing additional user space load on the NSE.)

glazychev-art commented 8 months ago

Thanks @zolug ,

Do I understand the topology correctly? Please correct me if I'm wrong Question: Is the connection between left NSC and right NSE one NSM connection? Or is it split into 2: NSC --- NSE/ + /NSC --- NSE?
Actually we already do ping with unprivileged mode: https://github.com/networkservicemesh/sdk-kernel/blob/main/pkg/kernel/tools/heal/liveness_check.go#L154-L162 And you are right, we need to set ping_group_range for that. Forwarder uses this chain element to set it on NSCs: https://github.com/networkservicemesh/sdk-kernel/tree/main/pkg/kernel/networkservice/connectioncontextkernel/pinggrouprange

zolug commented 8 months ago

@glazychev-art

1., They are 2 separate connections. (In fact there could be multiple NSCs on the left and multiple NSEs on the right interconnected by the same central NSE/NSC POD.)

I mostly highlighted the setup to emphasize that in our case there could be 2 "types" of datapath monitoring use cases: when the the bridge is in the NSE and when the bridge is in the NSC. (So, for example currently if I wanted to make datapath pings work, one way would be to replace either the src or dst of the pings with the bridge IPs (IPv4 and IPv6) depending on the type of the link.)

2., Hmm, I wasn't aware of this chain element enabling unprivileged ping for all group ids.

glazychev-art commented 8 months ago

@zolug Got it, thanks The picture I sent is about right, right?

If you have multiple NSC and NSE, do you use the same bridge, or is a unique one created for each pair?

zolug commented 8 months ago

@zolug Got it, thanks The picture I sent is about right, right?

If you have multiple NSC and NSE, do you use the same bridge, or is a unique one created for each pair?

@glazychev-art Yes, your picture is about right (although connections and the bridge have IPV6 addresses as well, again from the same subnet). There's one common bridge for all.

glazychev-art commented 8 months ago

@zolug How is an IP address assigned to the bridge? Are you doing this manually? Could you please tell me why bridge has an address?

glazychev-art commented 8 months ago

@zolug Another question: Do the logs attached here represent this setup or something else?

zolug commented 8 months ago

@zolug How is an IP address assigned to the bridge? Are you doing this manually? Could you please tell me why bridge has an address?

@glazychev-art Yes, bridge IP is configured manually. (But address is assigned by a custom IPAM functionality just like for the NSM connections.)

The reason why the bridge has IP address has mostly historical reasons and is related to "traffic" issues affecting L2 address resolution explained in the 1st point of this comment: https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/664#issuecomment-1941401091. Some more background: In left NSC the NSM interface(s) will have addtional virtual IPs configured (/32 IPv4 and /128 IPv6), that are used by application traffic. Thus, left NSC applies Policy Based Routes to point traffic from such virtual IPs to the proper next hop IP (the bridge IP) of the NSC/NSE POD.

zolug commented 8 months ago

@zolug Another question: Do the logs attached here represent this setup or something else?

I don't think they are related.

glazychev-art commented 8 months ago

Thanks @zolug It seems to me that your case is similar to the vl3 network. Have you considered using vl3 endpoints? https://github.com/networkservicemesh/deployments-k8s/tree/v1.12.1-rc.1/examples/features/vl3-basic

glazychev-art commented 8 months ago

@zolug Actually, I still don’t fully understand the setup. You mentioned that NSC has VIP addresses. Does the diagram end up looking like this? (IPv6 was omitted): topology2 It would be great if you could correct me. Or provide logs illustrated the case. Thanks

zolug commented 8 months ago

@glazychev-art Hi, The NSM interfaces and the bridge are supposed to have an IP from the same /24 subnet. While NSM interface in the left NSC POD additionally have (at least 1) /32 VIP address (not part of the above subnet).

The NSC --- NSE/NSC --- NSE setup is a simplified version of our architecture. Here's a (slightly outdated) picture better depicting our architecture. For simplicity we could assume that the right NSE also has a VIP address through which it communicates with the left NSC's VIP. (So, for traffic from the left NSC, the NSE/NSC would act as a router selecting the right NSE. And for traffic from the right NSE, the right NSE would be aware of the left NSC address and would employ route with that IP as next hop, while NSE/NSE acted as a bridge.)

Also, IMHO all the delicate details are not that important, my goal was mainly to illustrate that a bridge could cause pings involving its slave port IPs to fail without proper arp_ignore and proxy ndp tweaks. Probably, a simple NSC --- NSE setup where any of the two POD could host a bridge to which the nsm interface would be attached would have been sufficient to present our concerns with the default datapath ping.

Btw, I'm not expecting any changes in NSM to address this use case, just figured it could be worth sharing it with you. (A custom chain element could cover the proxy ndp entries. Although in the past NSM connection close wasn't really reliable, so I was instead thinking about a custom datapath ping using the bridge IP. But using the bridge IP in the left NSC (if we look at the picture) would be cumbersome, since the IP would not be known from start, instead would be learnt via the NSM connection's IP context eventually.)

glazychev-art commented 8 months ago

@zolug Got it, thanks Is it possible for you to send logs from the left NSC, excluding all private information?

zolug commented 7 months ago

@zolug Got it, thanks Is it possible for you to send logs from the left NSC, excluding all private information?

I'm a bit puzzled what logs you might be after. The details around bridge are not (or maybe loosely) connected to the original issue. I merely added this information here as I've been asked to do so. We do not enable datapath monitoring because by default the bridge network setup breaks any communication in the cases I described. If enabled, heal would constantly run into failed datapath.

glazychev-art commented 7 months ago

@zolug Let me explain why I asked this. Perhaps I don't fully understand your configuration. We would just like to know what information the NSC has after the Request. For example, what is in its src/dst ips/routes.

Based on this picture, there are still a few questions. For example, consider the connection between Target-1 -- Proxy-A (that is, nsc486 and nse475). Both of these interfaces have IP addresses (172.16.0.2/24 and 172.16.0.3/24). Why can't we use these IPs to check the datapath (without reaching the bridge)? Or are these addresses also assigned manually, just like the address for the bridge, and we cannot start checking the datapath immediately after the Request?

Also the question is at what point the client receives IP 20.0.0.1. Does this mean that immediately after the Request the connection is unavailable because 172.16.0.3/24 is not manually configured?

Perhaps I misunderstood you. But it seems to me that it would be wrong to check the entire connection at once (for example Target-1 -- load-balancer-A). Because in fact these are 2 NSM connections that need to be checked separately (Target-1 (nsc486) -- Proxy-A (nse475) and Proxy-A (nsc486) -- load-balancer-A (nse475)). And as far as I understand, in this case the bridge is not involved. Am I missing something?

Enabling datapath healing would allow us to more reliably solve the healing problem, I think

zolug commented 7 months ago

@glazychev-art

Why can't we use these IPs to check the datapath (without reaching the bridge)? Or are these addresses also assigned manually, just like the address for the bridge, and we cannot start checking the datapath immediately after the Request?

The reason is described in https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/664#issuecomment-1941401091: The nse475 interface is moved into the linux bridge right after its creation (by us). Assuming net.ipv4.conf.default.arp_ignore is 1, the bridge will inherit the same arp_ignore 1 value when created, thus arp resolution will break for slave port IPs: The bridge will consider all incoming packets as if itself was the receiving interface, thus due to arp_ignore=1 would only answer if the IP was configured on the bridge interface (it will also forward the request through its other slave ports but that won't help our case). In case of IPv6 L2 address resolution fails by default (irrespective of any sysctl values AFAIK), and proxy ndp entries have to be set up to workaround it.

The IPs in the example (172.16.0.2/24 and 172.16.0.3/24) are assigned via the NSE logic of the Proxy similar to what an example NSE in NSM repo would do. So, they are available once the initial Request returns on the NSC side.

But it seems to me that it would be wrong to check the entire connection at once (for example Target-1 -- load-balancer-A).

That's correct. (There can be and normally should be multiple paths available in each Proxy POD to send a packet towards a load-balancer POD.)

And as far as I understand, in this case the bridge is not involved. Am I missing something?

Bridge is still involved, but multi-path routing will decide which slave interface the packet should be sent out towards a load-balancer.

Also the question is at what point the client receives IP 20.0.0.1.

It's received separately by our left NSC (Target) through an independent "communication channel", upon which the NSC will update the already established connection by sending out and updated Request with modified IPContext (20.0.0.1 will be appended as src_ip_addrs). (The Request will also contain matching networkservice.PolicyRoute to point traffic sent from 20.0.0.1 to the bridge IP as next hop IP.) Such IPs might come and go on the fly. Also, in our case these IPs are of no interest when it comes to datapath monitoring (They are used in end-to-end communication.).

Example logs: Here's an example what is returned as connection on an initial Request (in a Target-X). (The bridge IPs are already passed as extra_prefixes by the Proxy. So, we could create a custom datapath checker to rely on these IPs. Or alternatively play with sysctls arp_ignore and proxy ndp entries as described before.)

"connection":"id:\"target-a-646b8df765-2p5dh-proxy.load-balancer-a1.trench-a.default-0\"  network_service:\"proxy.load-balancer-a1.trench-a.default\"  mechanism:{cls:\"LOCAL\"  type:\"KERNEL\"  parameters:{key:\"inodeURL\"  value:\"file:///proc/thread-self/ns/net\"}  parameters:{key:\"name\"  value:\"nsm-0\"}}  context:{ip_context:{src_ip_addrs:\"172.16.3.4/24\"  src_ip_addrs:\"fd00:0:0:3::4/64\"  dst_ip_addrs:\"172.16.3.5/24\"  dst_ip_addrs:\"fd00:0:0:3::5/64\"  excluded_prefixes:\"10.96.0.0/16\"  excluded_prefixes:\"10.244.0.0/16\"  extra_prefixes:\"172.16.3.1/24\"  extra_prefixes:\"fd00:0:0:3::1/64\"}  MTU:1500}  labels:{key:\"nodeName\"  value:\"kind-worker4\"}  path:{path_segments:{name:\"target-a-646b8df765-2p5dh\"  id:\"target-a-646b8df765-2p5dh-proxy.load-balancer-a1.trench-a.default-0\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL25zbWdyLXNhIl0sImV4cCI6MTcwOTA0MjM4OH0.HYz16vOYG95a942lhw4cTRiiGyFalG0cxJdjJ7Sw3tR6fxE9kxGi6gBbKFxkglDeB97JjFc1P1l46ZcYSrjlZw\"  expires:{seconds:1709040767}}  path_segments:{name:\"nsmgr-c6gdx\"  id:\"721c438c-974f-486a-ad96-c96ccafad63a\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkwfQ.8ALaT7H7TJyDFRenwiuhwdqKT092lqgCWgWEoaKetfjbp6Tv0MXLBDSnZQpX0bN0KRDnB7yyDytxqNuZhLNaTg\"  expires:{seconds:1709039390  nanos:287079298}}  path_segments:{name:\"forwarder-vpp-7sg7s\"  id:\"c7d58de5-0980-49aa-9c89-0e4da18e4510\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkwfQ.zQwfh1wEpOFT3zK3LCKNz9R7tSxQRWnrb7vlHoVhlOkLB2ozTDL8tA3TMrnJ2Kme5K17331FSUMi6hcdyQscvw\"  expires:{seconds:1709039390  nanos:475315614}  metrics:{key:\"client_drops\"  value:\"1\"}  metrics:{key:\"client_interface\"  value:\"VIRTIO/tap0\"}  metrics:{key:\"client_rx_bytes\"  value:\"110\"}  metrics:{key:\"client_rx_packets\"  value:\"1\"}  metrics:{key:\"client_tx_bytes\"  value:\"0\"}  metrics:{key:\"client_tx_packets\"  value:\"0\"}  metrics:{key:\"server_drops\"  value:\"0\"}  metrics:{key:\"server_interface\"  value:\"VIRTIO/tap7\"}  metrics:{key:\"server_rx_bytes\"  value:\"0\"}  metrics:{key:\"server_rx_packets\"  value:\"0\"}  metrics:{key:\"server_tx_bytes\"  value:\"0\"}  metrics:{key:\"server_tx_packets\"  value:\"0\"}}  path_segments:{name:\"proxy-load-balancer-a1-4rlss\"  id:\"eb70bcf6-70d9-4fbd-8eb1-fca7ae12ae15\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkwfQ.Id73fQJp_ZhQH0iFvaj8ULWSSfz_Ad4jN6v7h4p5XKqOa68xe0I2aBzzVFQuFJ-39imUFd2AQaD7OL6cuOOPLw\"  expires:{seconds:1709039390  nanos:476074482}}}  network_service_endpoint_name:\"proxy-load-balancer-a1-4rlss\"  payload:\"ETHERNET\""

And here's how it looks like once an established NSM connection is updated with VIP addresses (20.0.0.1/32, 10.0.0.1/32, 2000::1/128) and associated Policy Routes:

{"severity":"debug","timestamp":"2024-02-27T12:59:54.275+00:00","service_id":"Meridio-tapa","message":"received eventIn: connections:{key:\"721c438c-974f-486a-ad96-c96ccafad63a\"  value:{id:\"721c438c-974f-486a-ad96-c96ccafad63a\"  network_service:\"proxy.load-balancer-a1.trench-a.default\"  mechanism:{cls:\"LOCAL\"  type:\"KERNEL\"  parameters:{key:\"inodeURL\"  value:\"inode://4/4026537114\"}  parameters:{key:\"name\"  value:\"nsm-0\"}}  context:{ip_context:{src_ip_addrs:\"172.16.3.4/24\"  src_ip_addrs:\"fd00:0:0:3::4/64\"  src_ip_addrs:\"2000::1/128\"  src_ip_addrs:\"20.0.0.1/32\"  src_ip_addrs:\"10.0.0.1/32\"  dst_ip_addrs:\"172.16.3.5/24\"  dst_ip_addrs:\"fd00:0:0:3::5/64\"  excluded_prefixes:\"10.96.0.0/16\"  excluded_prefixes:\"10.244.0.0/16\"  extra_prefixes:\"172.16.3.1/24\"  extra_prefixes:\"fd00:0:0:3::1/64\"  policies:{from:\"2000::1/128\"  routes:{prefix:\"::/0\"  nextHop:\"fd00:0:0:3::1\"}}  policies:{from:\"20.0.0.1/32\"  routes:{prefix:\"0.0.0.0/0\"  nextHop:\"172.16.3.1\"}}  policies:{from:\"10.0.0.1/32\"  routes:{prefix:\"0.0.0.0/0\"  nextHop:\"172.16.3.1\"}}}  MTU:1500}  labels:{key:\"nodeName\"  value:\"kind-worker4\"}  path:{index:1  path_segments:{name:\"target-a-646b8df765-2p5dh\"  id:\"target-a-646b8df765-2p5dh-proxy.load-balancer-a1.trench-a.default-0\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL25zbWdyLXNhIl0sImV4cCI6MTcwOTA0MjM4OH0.VlhnpXkXW7AXlSCLfzTnvvL06N6_kUYFlkRWuzHVYEPnh713NmYUF8MaOV6JRDzqXmuZlZfcwjSjev7Ek6SyEQ\"  expires:{seconds:1709040767}}  path_segments:{name:\"nsmgr-c6gdx\"  id:\"721c438c-974f-486a-ad96-c96ccafad63a\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkyfQ.rj1WKymPqaIdv37FOXsCspeOlDfpDTb1fIGvRuWkDbiKgGRUe5W1tiWXWFDqN8c_Mu897QsXDrO63kmAEWvNgg\"  expires:{seconds:1709039392  nanos:718167533}}  path_segments:{name:\"forwarder-vpp-7sg7s\"  id:\"c7d58de5-0980-49aa-9c89-0e4da18e4510\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkzfQ.UUL0BFtKxkro3kvrTwaY8QSugCIz6uSH321Blr0XJjRuiLazcKv6OgrFoN1mF5gpD-fyulIWEUt9-ENW7elZAA\"  expires:{seconds:1709039393  nanos:91726945}  metrics:{key:\"client_drops\"  value:\"6\"}  metrics:{key:\"client_interface\"  value:\"VIRTIO/tap0\"}  metrics:{key:\"client_rx_bytes\"  value:\"796\"}  metrics:{key:\"client_rx_packets\"  value:\"6\"}  metrics:{key:\"client_tx_bytes\"  value:\"486\"}  metrics:{key:\"client_tx_packets\"  value:\"5\"}  metrics:{key:\"server_drops\"  value:\"0\"}  metrics:{key:\"server_interface\"  value:\"VIRTIO/tap7\"}  metrics:{key:\"server_rx_bytes\"  value:\"596\"}  metrics:{key:\"server_rx_packets\"  value:\"6\"}  metrics:{key:\"server_tx_bytes\"  value:\"0\"}  metrics:{key:\"server_tx_packets\"  value:\"0\"}}  path_segments:{name:\"proxy-load-balancer-a1-4rlss\"  id:\"eb70bcf6-70d9-4fbd-8eb1-fca7ae12ae15\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkzfQ.1XWCLyUBwVaL6-urM86WydMuHrbHk20_SnQ1QE-EzKhT2cn71_u8ZqpkqSG0QMVst-xUjB6kOWcqedm2zFwQVQ\"  expires:{seconds:1709039393  nanos:92324963}}}  network_service_endpoint_name:\"proxy-load-balancer-a1-4rlss\"  payload:\"ETHERNET\"}}, err: <nil>","version":"1.0.0","extra_data":{"subsystem":"NSM","id":"target-a-646b8df765-2p5dh-proxy.load-balancer-a1.trench-a.default-0","type":"networkService","heal":"eventLoop"}}

networkservicemesh / cmd-forwarder-vpp

Investigate why forwarder could restart too long #664

Steps to reproduce