networkservicemesh / cmd-forwarder-vpp

Apache License 2.0
2 stars 22 forks source link

Faulty behavior of forwarder-vpp blocks the heal process #1161

Open szvincze opened 3 weeks ago

szvincze commented 3 weeks ago

We observed a problem with forwarder-vpp which randomly starts ignoring requests for certain node internal connections. For example, a refresh seems to fail randomly which triggers the healing mechanism. As a part of it, in most of the cases even the Close is unsuccessful. Then no matter how hard the heal tries to repair the connection every request times out in nsmgr, because even if the forwarder-vpp gets the requests but it does not handle them, for instance does not send it to the NSE.

So, there are request timeouts in nsmgr in every 15 seconds (which comes from the request timeout on NSC side). However in the forwarder-vpp we only saw that the request received 15 seconds earlier but forwarder-vpp did nothing with it. Even though it results a leaked interface but in our opinion here it is just a consequence and not the root cause of the problem.

It is important to mention that so far this behavior observed only for node internal connections. Also note that the request does not reach the discover/discoverCandidateServer.Request() in the forwarder-vpp where the first debug message would come but it did not appear.

The situation recovers after restart of forwarder-vpp.

The used NSM version is v1.13.2 while NSCs use v1.11.2 SDK version.

Unfortunately the DEBUG logs (as they are in the used NSM release) are not sufficient to analyze the case deeper. However it seems the problem encountered quite frequently (in couple of days), the TRACE log level is set on the system and we are waiting for reproduction, so hopefully it would appear soon and I can add detailed logs to this issue.

denis-tingaikin commented 3 weeks ago

Could you try to use v1.14.0-rc.1?

szvincze commented 3 weeks ago

It comes on a customer deployment. So far we did not manage to reproduce this situation in lab environments, but working on it. If it succeeds the we will be able to try v1.14.0-rc.1.

denis-tingaikin commented 2 weeks ago

@ljkiraly I found that healing could be triggered by deleting the network service. 

So you may run. 

kubectl delete networkservicemesh $networkserviceName

If the endpoint has the option of self-service registration, then it's enough. Otherwise, just apply the service after some period of time.

kubectl apply -f $networkserviceName
ljkiraly commented 1 week ago

Unfortunately I can not reproduce this issue in my environment. (with NSMv 1.13.2 neither with rc.2) The problem was detected yesterday in a test environment (NSM v1.13.2) and we got some logs which contains TRACE level printouts.

"dsc-fdr-7bdfc7b8b5-f7js4" pod can not connect to NS: "proxy.vpn1-b.sc-diam.deascmqv02". The connection request (id:65102507-b8d3-4266-8633-421e569ee7c7) is blocked in beginServer.Request() as you can see in the attached log: forwarder-vpp-sljd6_forwarder-vpp.txt

Once again: the failure popped up randomly during a long running traffic case, where no elements were restarted.

NikitaSkrynnik commented 1 week ago

I've managed to reproduce the problem on NSM v1.13.2, but it looks like the issue is gone on NSM v1.14.0-rc.2. Changes in begin fix it. But we've encountered a couple of new issues: continuous closes and sometimes recvfd freezes.

ljkiraly commented 1 week ago

@NikitaSkrynnik These are good news! Can you describe the reproduction steps?

NikitaSkrynnik commented 1 week ago

@ljkiraly, sure

  1. Create 1 node cluster
  2. set up NSM (basic setup, v1.13.2)
  3. Deploy 1 NSE
  4. Deploy 40 NSCs with NSM_MAX_TOKEN_LIFETIME=3m
  5. Wait until some NSCs start to get errors on refreshes. Sometimes I don't get any errors, so I repeat steps 2-4 again. Usually after repeating them for 3-4 times I get the error
ljkiraly commented 6 days ago

@NikitaSkrynnik , Thank you. Based on this description I was able to get different connection errors with NSM v1.13.2. Sometimes only 6 NSC connected successfully. After checking the forwarder logs I having concerns if we reproducing the same fault as originally reported. As I noticed in the previous comment the last TRACE level log output related to a faulty connection was the "beginServer.Request()" (https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/1161#issuecomment-2329371533). I repeated the steps 2-4 for 5 times and getting many type of connection errors, but seemingly the forwarder always run through the beginServer. Maybe I have to be more patient and try further.

Please confirm if you have seen any connection stuck in 'beginServer' based on forwarder logs.

NikitaSkrynnik commented 6 days ago

@ljkiraly, you can also try the scripts i used healbug copy.zip

Run ./run.sh command and after it's finishe search for string policy failed in nsc's logs. Usually I get this error on 3-4 iteration. This error usually means that beginServer is stuck.

denis-tingaikin commented 20 hours ago

@NikitaSkrynnik Could you please attach results from our the last internal test?

NikitaSkrynnik commented 19 hours ago

@denis-tingaikin, @ljkiraly after 30 runs on rc.6 I can't reproduce the error

denis-tingaikin commented 19 hours ago

@NikitaSkrynnik And how often it's reproducing without the fix? It's about 1-3 runs right?

NikitaSkrynnik commented 18 hours ago

@denis-tingaikin, 3-5 runs usually

denis-tingaikin commented 16 hours ago

@NikitaSkrynnik Could you please also check it with rc.7?