Open szvincze opened 3 weeks ago
Could you try to use v1.14.0-rc.1?
It comes on a customer deployment. So far we did not manage to reproduce this situation in lab environments, but working on it. If it succeeds the we will be able to try v1.14.0-rc.1.
@ljkiraly I found that healing could be triggered by deleting the network service.
So you may run.
kubectl delete networkservicemesh $networkserviceName
If the endpoint has the option of self-service registration, then it's enough. Otherwise, just apply the service after some period of time.
kubectl apply -f $networkserviceName
Unfortunately I can not reproduce this issue in my environment. (with NSMv 1.13.2 neither with rc.2) The problem was detected yesterday in a test environment (NSM v1.13.2) and we got some logs which contains TRACE level printouts.
"dsc-fdr-7bdfc7b8b5-f7js4" pod can not connect to NS: "proxy.vpn1-b.sc-diam.deascmqv02". The connection request (id:65102507-b8d3-4266-8633-421e569ee7c7) is blocked in beginServer.Request() as you can see in the attached log: forwarder-vpp-sljd6_forwarder-vpp.txt
Once again: the failure popped up randomly during a long running traffic case, where no elements were restarted.
I've managed to reproduce the problem on NSM v1.13.2, but it looks like the issue is gone on NSM v1.14.0-rc.2. Changes in begin
fix it. But we've encountered a couple of new issues: continuous closes and sometimes recvfd
freezes.
@NikitaSkrynnik These are good news! Can you describe the reproduction steps?
@ljkiraly, sure
NSM_MAX_TOKEN_LIFETIME=3m
@NikitaSkrynnik , Thank you. Based on this description I was able to get different connection errors with NSM v1.13.2. Sometimes only 6 NSC connected successfully. After checking the forwarder logs I having concerns if we reproducing the same fault as originally reported. As I noticed in the previous comment the last TRACE level log output related to a faulty connection was the "beginServer.Request()" (https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/1161#issuecomment-2329371533). I repeated the steps 2-4 for 5 times and getting many type of connection errors, but seemingly the forwarder always run through the beginServer. Maybe I have to be more patient and try further.
Please confirm if you have seen any connection stuck in 'beginServer' based on forwarder logs.
@ljkiraly, you can also try the scripts i used healbug copy.zip
Run ./run.sh
command and after it's finishe search for string policy failed
in nsc's logs. Usually I get this error on 3-4 iteration. This error usually means that beginServer is stuck.
@NikitaSkrynnik Could you please attach results from our the last internal test?
@denis-tingaikin, @ljkiraly after 30 runs on rc.6
I can't reproduce the error
@NikitaSkrynnik And how often it's reproducing without the fix? It's about 1-3 runs right?
@denis-tingaikin, 3-5 runs usually
@NikitaSkrynnik Could you please also check it with rc.7?
We observed a problem with
forwarder-vpp
which randomly starts ignoring requests for certain node internal connections. For example, a refresh seems to fail randomly which triggers the healing mechanism. As a part of it, in most of the cases even theClose
is unsuccessful. Then no matter how hard the heal tries to repair the connection every request times out innsmgr
, because even if theforwarder-vpp
gets the requests but it does not handle them, for instance does not send it to theNSE
.So, there are request timeouts in
nsmgr
in every 15 seconds (which comes from the request timeout onNSC
side). However in theforwarder-vpp
we only saw that the request received 15 seconds earlier butforwarder-vpp
did nothing with it. Even though it results a leaked interface but in our opinion here it is just a consequence and not the root cause of the problem.It is important to mention that so far this behavior observed only for node internal connections. Also note that the request does not reach the
discover/discoverCandidateServer.Request()
in theforwarder-vpp
where the first debug message would come but it did not appear.The situation recovers after restart of
forwarder-vpp
.The used
NSM
version is v1.13.2 whileNSC
s use v1.11.2 SDK version.Unfortunately the
DEBUG
logs (as they are in the usedNSM
release) are not sufficient to analyze the case deeper. However it seems the problem encountered quite frequently (in couple of days), theTRACE
log level is set on the system and we are waiting for reproduction, so hopefully it would appear soon and I can add detailed logs to this issue.