networkservicemesh / deployments-k8s

Apache License 2.0
42 stars 35 forks source link

TestRemote_nsmgr_restart/TestRunHealSuite is unstable on GKE #2800

Closed denis-tingaikin closed 2 years ago

denis-tingaikin commented 3 years ago

Descritpion

scenario https://github.com/networkservicemesh/deployments-k8s/tree/main/examples/heal/remote-nsmgr-restart#remote-nsmgr-restart is unstable on ci.

Steps to reproduce

  1. Run TestRunHealSuite a few times on GKE (probably can be reproduced on kind)

Logs

Execution attempt: 0 Output file: .tests/cloud_test/gke-1/047-TestRunHealSuite1-TestRemote_nsmgr_restart-run.log
=== RUN   TestRunHealSuite/TestRemote_nsmgr_restart
time=2021-08-30T09:26:08Z level=info msg=NAMESPACE=($(kubectl create -f ../namespace.yaml)[0])
NAMESPACE=${NAMESPACE:10} TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:09Z level=info msg=kubectl exec -n spire spire-server-0 -- \
/opt/spire/bin/spire-server entry create \
-spiffeID spiffe://example.org/ns/${NAMESPACE}/sa/default \
-parentID spiffe://example.org/ns/spire/sa/spire-agent \
-selector k8s:ns:${NAMESPACE} \
-selector k8s:sa:default TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:09Z level=info msg=Entry ID         : b8204588-3988-49bb-83dd-14adf50ee9bd
SPIFFE ID        : spiffe://example.org/ns/ns-ghmmq/sa/default
Parent ID        : spiffe://example.org/ns/spire/sa/spire-agent
Revision         : 0
TTL              : default
Selector         : k8s:ns:ns-ghmmq
Selector         : k8s:sa:default TestRunHealSuite/TestRemote_nsmgr_restart=stdout
time=2021-08-30T09:26:09Z level=info msg=NODES=($(kubectl get nodes -o go-template='{{range .items}}{{ if not .spec.taints  }}{{index .metadata.labels "kubernetes.io/hostname"}} {{end}}{{end}}')) TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:09Z level=info msg=cat > kustomization.yaml <<EOF
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: ${NAMESPACE}

bases:
- ../../../apps/nsc-kernel
- ../../../apps/nse-kernel

patchesStrategicMerge:
- patch-nsc.yaml
- patch-nse.yaml
EOF TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:09Z level=info msg=cat > patch-nsc.yaml <<EOF
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nsc-kernel
spec:
template:
spec:
containers:
- name: nsc
env:
- name: NSM_REQUEST_TIMEOUT
value: 45s
- name: NSM_NETWORK_SERVICES
value: kernel://icmp-responder/nsm-1
nodeSelector:
kubernetes.io/hostname: ${NODES[0]}
EOF TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:09Z level=info msg=cat > patch-nse.yaml <<EOF
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nse-kernel
spec:
template:
spec:
containers:
- name: nse
env:
- name: NSM_CIDR_PREFIX
value: 172.16.1.100/30
nodeSelector:
kubernetes.io/hostname: ${NODES[1]}
EOF TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:09Z level=info msg=kubectl apply -k . TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:11Z level=info msg=deployment.apps/nsc-kernel created
deployment.apps/nse-kernel created TestRunHealSuite/TestRemote_nsmgr_restart=stdout
time=2021-08-30T09:26:11Z level=info msg=kubectl wait --for=condition=ready --timeout=1m pod -l app=nsc-kernel -n ${NAMESPACE} TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:15Z level=info msg=pod/nsc-kernel-749f5cd847-gcnbx condition met TestRunHealSuite/TestRemote_nsmgr_restart=stdout
time=2021-08-30T09:26:15Z level=info msg=kubectl wait --for=condition=ready --timeout=1m pod -l app=nse-kernel -n ${NAMESPACE} TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:15Z level=info msg=pod/nse-kernel-7475549f6-tgf7m condition met TestRunHealSuite/TestRemote_nsmgr_restart=stdout
time=2021-08-30T09:26:15Z level=info msg=NSC=$(kubectl get pods -l app=nsc-kernel -n ${NAMESPACE} --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}') TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:16Z level=info msg=NSE=$(kubectl get pods -l app=nse-kernel -n ${NAMESPACE} --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}') TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:16Z level=info msg=kubectl exec ${NSC} -n ${NAMESPACE} -- ping -c 4 172.16.1.100 TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:29Z level=info msg=PING 172.16.1.100 (172.16.1.100): 56 data bytes

--- 172.16.1.100 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss TestRunHealSuite/TestRemote_nsmgr_restart=stdout
time=2021-08-30T09:26:29Z level=info msg=command terminated with exit code 1 TestRunHealSuite/TestRemote_nsmgr_restart=stderr
time=2021-08-30T09:26:29Z level=info msg=1 TestRunHealSuite/TestRemote_nsmgr_restart=exitCode
time=2021-08-30T09:26:29Z level=info msg=kubectl exec ${NSC} -n ${NAMESPACE} -- ping -c 4 172.16.1.100 TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:43Z level=info msg=PING 172.16.1.100 (172.16.1.100): 56 data bytes

--- 172.16.1.100 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss TestRunHealSuite/TestRemote_nsmgr_restart=stdout
time=2021-08-30T09:26:43Z level=info msg=command terminated with exit code 1 TestRunHealSuite/TestRemote_nsmgr_restart=stderr
time=2021-08-30T09:26:43Z level=info msg=1 TestRunHealSuite/TestRemote_nsmgr_restart=exitCode
time=2021-08-30T09:26:43Z level=info msg=kubectl exec ${NSC} -n ${NAMESPACE} -- ping -c 4 172.16.1.100 TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:26:56Z level=info msg=PING 172.16.1.100 (172.16.1.100): 56 data bytes

--- 172.16.1.100 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss TestRunHealSuite/TestRemote_nsmgr_restart=stdout
time=2021-08-30T09:26:56Z level=info msg=command terminated with exit code 1 TestRunHealSuite/TestRemote_nsmgr_restart=stderr
time=2021-08-30T09:26:56Z level=info msg=1 TestRunHealSuite/TestRemote_nsmgr_restart=exitCode
time=2021-08-30T09:26:56Z level=info msg=kubectl exec ${NSC} -n ${NAMESPACE} -- ping -c 4 172.16.1.100 TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:27:10Z level=info msg=PING 172.16.1.100 (172.16.1.100): 56 data bytes

--- 172.16.1.100 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss TestRunHealSuite/TestRemote_nsmgr_restart=stdout
time=2021-08-30T09:27:10Z level=info msg=command terminated with exit code 1 TestRunHealSuite/TestRemote_nsmgr_restart=stderr
time=2021-08-30T09:27:10Z level=info msg=1 TestRunHealSuite/TestRemote_nsmgr_restart=exitCode
time=2021-08-30T09:27:10Z level=info msg=kubectl exec ${NSC} -n ${NAMESPACE} -- ping -c 4 172.16.1.100 TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:27:23Z level=info msg=PING 172.16.1.100 (172.16.1.100): 56 data bytes

--- 172.16.1.100 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss TestRunHealSuite/TestRemote_nsmgr_restart=stdout
time=2021-08-30T09:27:23Z level=info msg=command terminated with exit code 1 TestRunHealSuite/TestRemote_nsmgr_restart=stderr
time=2021-08-30T09:27:23Z level=info msg=1 TestRunHealSuite/TestRemote_nsmgr_restart=exitCode
time=2021-08-30T09:27:23Z level=error msg=command didn't succeed until timeout cmd=kubectl exec ${NSC} -n ${NAMESPACE} -- ping -c 4 172.16.1.100
suite.go:122:
Error Trace:    suite.go:122
suite.gen.go:224
Error:          Not equal:
expected: 0
actual  : 1
Test:           TestRunHealSuite/TestRemote_nsmgr_restart
time="2021-08-30T09:27:23Z" level=error msg="nse-kernel-7475549f6-tgf7m: An error while retrieving logs: the server rejected our request for an unknown reason (get pods nse-kernel-7475549f6-tgf7m)"
time="2021-08-30T09:27:23Z" level=error msg="admission-webhook-k8s-c6546cd97-h844v: An error while retrieving logs: the server rejected our request for an unknown reason (get pods admission-webhook-k8s-c6546cd97-h844v)"
time="2021-08-30T09:27:23Z" level=error msg="nsc-kernel-749f5cd847-gcnbx: An error while retrieving logs: the server rejected our request for an unknown reason (get pods nsc-kernel-749f5cd847-gcnbx)"
time="2021-08-30T09:27:23Z" level=error msg="forwarder-vpp-7pqjc: An error while retrieving logs: the server rejected our request for an unknown reason (get pods forwarder-vpp-7pqjc)"
time="2021-08-30T09:27:23Z" level=error msg="registry-k8s-5c7c66979c-2qhcq: An error while retrieving logs: the server rejected our request for an unknown reason (get pods registry-k8s-5c7c66979c-2qhcq)"
time="2021-08-30T09:27:23Z" level=error msg="nsmgr-wzswf: An error while retrieving logs: the server rejected our request for an unknown reason (get pods nsmgr-wzswf)"
time="2021-08-30T09:27:23Z" level=error msg="forwarder-vpp-pdglf: An error while retrieving logs: the server rejected our request for an unknown reason (get pods forwarder-vpp-pdglf)"
time="2021-08-30T09:27:23Z" level=error msg="nsmgr-gd6nx: An error while retrieving logs: the server rejected our request for an unknown reason (get pods nsmgr-gd6nx)"
time=2021-08-30T09:27:23Z level=info msg=kubectl delete ns ${NAMESPACE} TestRunHealSuite/TestRemote_nsmgr_restart=stdin
time=2021-08-30T09:27:35Z level=info msg=namespace "ns-ghmmq" deleted TestRunHealSuite/TestRemote_nsmgr_restart=stdout
time="2021-08-30T09:27:35Z" level=error msg="admission-webhook-k8s-c6546cd97-h844v: An error while retrieving logs: the server rejected our request for an unknown reason (get pods admission-webhook-k8s-c6546cd97-h844v)"
time="2021-08-30T09:27:35Z" level=error msg="nsmgr-wzswf: An error while retrieving logs: the server rejected our request for an unknown reason (get pods nsmgr-wzswf)"
time="2021-08-30T09:27:36Z" level=error msg="registry-k8s-5c7c66979c-2qhcq: An error while retrieving logs: the server rejected our request for an unknown reason (get pods registry-k8s-5c7c66979c-2qhcq)"
time="2021-08-30T09:27:36Z" level=error msg="forwarder-vpp-7pqjc: An error while retrieving logs: the server rejected our request for an unknown reason (get pods forwarder-vpp-7pqjc)"
time="2021-08-30T09:27:36Z" level=error msg="forwarder-vpp-pdglf: An error while retrieving logs: the server rejected our request for an unknown reason (get pods forwarder-vpp-pdglf)"
time="2021-08-30T09:27:36Z" level=error msg="nsmgr-gd6nx: An error while retrieving logs: the server rejected our request for an unknown reason (get pods nsmgr-gd6nx)"
time=2021-08-30T09:27:36Z level=info msg=kubectl delete mutatingwebhookconfiguration --all
kubectl delete ns nsm-system TestRunHealSuite=stdin
time=2021-08-30T09:28:09Z level=info msg=mutatingwebhookconfiguration.admissionregistration.k8s.io "admission-webhook-k8s-c6546cd97-h844v" deleted
mutatingwebhookconfiguration.admissionregistration.k8s.io "pod-ready.config.common-webhooks.networking.gke.io" deleted
namespace "nsm-system" deleted TestRunHealSuite=stdout
time=2021-08-30T09:28:09Z level=info msg=kubectl delete ns spire TestRunHealSuite=stdin
time=2021-08-30T09:28:26Z level=info msg=namespace "spire" deleted TestRunHealSuite=stdout
--- FAIL: TestRunHealSuite/TestRemote_nsmgr_restart (86.66s)

Build

https://github.com/networkservicemesh/integration-k8s-gke/runs/3459839009

Artifacts

artifacts.zip

denis-tingaikin commented 3 years ago

@edwarnicke Should we consider this or we should wait for reworking heal with begin? Please share your thoughts.

edwarnicke commented 2 years ago

@denis-tingaikin Is this still an active issue?

denis-tingaikin commented 2 years ago

@edwarnicke Yes.

I found a concrete problem:

Jan 24 10:02:12.221 [TRAC] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (61)                                                             ⎆ sdk/pkg/networkservice/core/next/nextClient.Request()
Jan 24 10:02:12.221 [TRAC] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (62)                                                              ⎆ sdk-vpp/pkg/networkservice/mechanisms/vxlan/vxlanClient.Request()
Jan 24 10:02:12.221 [TRAC] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (63)                                                               ⎆ sdk-vpp/pkg/networkservice/mechanisms/vxlan/mtu/mtuClient.Request()
Jan 24 10:02:12.221 [ERRO] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (63.1)                                                                 error attempting to get interface dump client to determine MTU for tunnelIP "10.128.0.57": context deadline exceeded
Jan 24 10:02:12.221 [TRAC] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (62.1)                                                                request-response=null
Jan 24 10:02:12.221 [ERRO] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (62.2)                                                                error attempting to get interface dump client to determine MTU for tunnelIP "10.128.0.57": context deadline exceeded
Jan 24 10:02:12.221 [ERRO] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (61.1)                                                               error attempting to get interface dump client to determine MTU for tunnelIP "10.128.0.57": context deadline exceeded
Jan 24 10:02:12.221 [ERRO] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (60.2)                                                              error attempting to get interface dump client to determine MTU for tunnelIP "10.128.0.57": context deadline exceeded
Jan 24 10:02:12.221 [ERRO] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (59.1)                                                             error attempting to get interface dump client to determine MTU for tunnelIP "10.128.0.57": context deadline exceeded
Jan 24 10:02:12.222 [ERRO] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (58.1)                                                            error attempting to get interface dump client to determine MTU for tunnelIP "10.128.0.57": context deadline exceeded
Jan 24 10:02:12.222 [ERRO] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (57.1)                                                           error attempting to get interface dump client to determine MTU for tunnelIP "10.128.0.57": context deadline exceeded
Jan 24 10:02:12.222 [ERRO] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (56.1)                                                          error attempting to get interface dump client to determine MTU for tunnelIP "10.128.0.57": context deadline exceeded
Jan 24 10:02:12.222 [ERRO] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (55.2)                                                         error attempting to get interface dump client to determine MTU for tunnelIP "10.128.0.57": context deadline exceeded
Jan 24 10:02:12.222 [ERRO] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (54.1)                                                        error attempting to get interface dump client to determine MTU for tunnelIP "10.128.0.57": context deadline exceeded
Jan 24 10:02:12.222 [ERRO] [id:bc136f66-909c-4c86-a9c6-689519f245c9] [type:networkService] (53.1)                                                       error attempting to get interface dump client to determine MTU for tunnelIP "10.128.0.57": context deadline

error attempting to get interface dump client to determine MTU for tunnelIP.

forwarder-vpp-kcwxn.log

Looking into go-vpp.