networkservicemesh / deployments-k8s

Apache License 2.0
42 stars 34 forks source link

Generic healing should work with ovs forwarder #9778

Open ljkiraly opened 1 year ago

ljkiraly commented 1 year ago

Expected Behavior

NSM should restore after forwarder death.

Current Behavior

Basic setup of NSM was used, containing forwarder-ovs . The connection between NSC and NSE is using local 'kernel' mechanism. After forwarder restart (local or remote) the connection does not restored (checked more then 10 minutes after forwarder restart)

Failure Information

No data path healing is expected, but the generic healing should work.

Steps to Reproduce

  1. Start a basic NSM setup with ovs-forwarder
kubectl apply -k https://github.com/networkservicemesh/deployments-k8s/examples/ovs?ref=2ea1a26c7ddebd212224560e2706d9f1b03f140b
  1. Start the NSE and NSC based on the case 'Test local forwarder death'
kubectl apply -k https://github.com/networkservicemesh/deployments-k8s/examples/heal/local-forwarder-death?ref=2ea1a26c7ddebd212224560e2706d9f1b03f140b

# Wait for applications ready:
kubectl wait --for=condition=ready --timeout=1m pod -l app=alpine -n ns-local-forwarder-death
kubectl wait --for=condition=ready --timeout=1m pod -l app=nse-kernel -n ns-local-forwarder-death

 # Ping from NSC to NSE:
kubectl exec pods/alpine -n ns-local-forwarder-death -- ping -c 4 172.16.1.100

 # Ping from NSE to NSC:
kubectl exec deployments/nse-kernel -n ns-local-forwarder-death -- ping -c 4 172.16.1.101
  1. Delete the local forwarder
# Find nsc node:
NSC_NODE=$(kubectl get pods -l app=alpine -n ns-local-forwarder-death --template '{{range .items}}{{.spec.nodeName}}{{"\n"}}{{end}}')

 # Find local Forwarder:
FORWARDER=$(kubectl get pods -l 'app in (forwarder-vpp,forwarder-ovs)' --field-selector spec.nodeName==${NSC_NODE} -n nsm-system --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')

# Remove local Forwarder and wait for a new one to start:
kubectl delete pod -n nsm-system ${FORWARDER}

kubectl wait --for=condition=ready --timeout=1m pod -l 'app in (forwarder-vpp,forwarder-ovs)' --field-selector spec.nodeName==${NSC_NODE} -n nsm-system
  1. Wait some time and check if the connection restores
 # Ping from NSC to NSE:
kubectl exec pods/alpine -n ns-local-forwarder-death -- ping -c 4 172.16.1.100

 # Ping from NSE to NSC:
kubectl exec deployments/nse-kernel -n ns-local-forwarder-death -- ping -c 4 172.16.1.101

Failure Logs

local-forwarder-death.zip

ljkiraly commented 1 year ago

Seems that NSM_MAX_TOKEN_LIFETIME default setting is 24h instead of 10 minutes lifetime used in vpp forwarder. I am to repeat the test with the NSM_MAX_TOKEN_LIFETIME=10min setting.

ljkiraly commented 1 year ago

Same result with NSM_MAX_TOKEN_LIFETIME=10min setting.

denis-tingaikin commented 1 year ago

@ljkiraly Could you also attach logs from the NSC? It could help with investigation.

ljkiraly commented 1 year ago

@denis-tingaikin Just realized that I missed to add the logs to the zip file. I updated the attachment and I adding the nsc logs here as well. BTW seems to me that the forwarder set the interfaces as expected. NSC detects data path failure and reset the connection.

{forwarder log}
Sep 12 16:53:02.609 [TRAC] [id:81d3be5c-cad4-4811-a6ef-5c4da4306c30] [type:networkService] (57.1)                                                           request-response={"id":"81d3be5c-cad4-4811-a6ef-5c4da4306c30","network_service":"local-forwarder-death","mechanism":{"cls":"LOCAL","type":"KERNEL","parameters":{"inodeURL":"inode://4/4026534093","name":"local-forw-f280","supportsVlan":"true"}},"context":{"ip_context":{"src_ip_addrs":["172.16.1.1/32"],"dst_ip_addrs":["172.16.1.0/32"],"src_routes":[{"prefix":"172.16.1.0/32"}],"dst_routes":[{"prefix":"172.16.1.1/32"}],"excluded_prefixes":["10.96.0.0/16","10.244.0.0/16"]},"dns_context":{}},"labels":{"nodeName":"kind-worker","podName":"alpine"},"path":{"index":2,"path_segments":[{"name":"alpine-b626de3e-acbb-4b4e-9c31-a81a43ca7f2b","id":"alpine-b626de3e-acbb-4b4e-9c31-a81a43ca7f2b-0","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9rOHMubnNtL25zL25zLWxvY2FsLWZvcndhcmRlci1kZWF0aC9wb2QvYWxwaW5lIiwiYXVkIjpbInNwaWZmZTovL2s4cy5uc20vbnMvbnNtLXN5c3RlbS9wb2QvbnNtZ3ItZjR2NGgiXSwiZXhwIjoxNjk0NTM4MTgyfQ.oYJibfRxCXQIcNEyvF_lWY-fMjzuBaktrGXvDcNnznI6X4z44MNRCtVd-NHJOGPtaAoxDUghUDYeE_6WcpFtxQ","expires":{"seconds":1694538182,"nanos":388924276}},{"name":"nsmgr-f4v4h","id":"b4b84141-cc9e-4e7b-be7f-cc3b89816e4a","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9rOHMubnNtL25zL25zbS1zeXN0ZW0vcG9kL25zbWdyLWY0djRoIiwiYXVkIjpbInNwaWZmZTovL2s4cy5uc20vbnMvbnNtLXN5c3RlbS9wb2QvZm9yd2FyZGVyLW92cy1iOTcybiJdLCJleHAiOjE2OTQ1MzgxODJ9.l5ssrZ5w4uhRW30WnjD3XAvti8S7TvszBtkw6ogCyltUwNZgX8TgRXQaqgEuBv9Zrv31Kl-8dhLWcnncrvMGfw","expires":{"seconds":1694538182,"nanos":407122067}},{"name":"forwarder-ovs-b972n","id":"81d3be5c-cad4-4811-a6ef-5c4da4306c30","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9rOHMubnNtL25zL25zbS1zeXN0ZW0vcG9kL2ZvcndhcmRlci1vdnMtYjk3Mm4iLCJhdWQiOlsic3BpZmZlOi8vazhzLm5zbS9ucy9ucy1sb2NhbC1mb3J3YXJkZXItZGVhdGgvcG9kL25zZS1rZXJuZWwtNmQ2NDc3ODc1Zi1ndGdkNyJdLCJleHAiOjE2OTQ1MzgxODJ9.ghEOIYMc0eYbRBxo269dicBL-_40o5vPxIwGQecEAQ5roOqR1FQHvWKQ3SKU3wrGNf_ylkbykrM7nVvR2VRS1g","expires":{"seconds":1694538182,"nanos":482070996}},{"name":"nse-kernel-6d6477875f-gtgd7","id":"f2807884-5a41-4d88-9a42-58347aada4c1","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9rOHMubnNtL25zL25zLWxvY2FsLWZvcndhcmRlci1kZWF0aC9wb2QvbnNlLWtlcm5lbC02ZDY0Nzc4NzVmLWd0Z2Q3IiwiYXVkIjpbInNwaWZmZTovL2s4cy5uc20vbnMvbnNtLXN5c3RlbS9wb2QvZm9yd2FyZGVyLW92cy1iOTcybiJdLCJleHAiOjE2OTQ1MzgxODJ9.V7oLhExCJrFkTwZYIG-LASq5vMT4HpMW-szmg2S5HVMJ8Qyww8cRK0_sTtX7bBQz9_uRnWnJw5Pb8TViuRW4jw","expires":{"seconds":1694538182,"nanos":485355075}}]},"network_service_endpoint_name":"nse-kernel-6d6477875f-gtgd7","payload":"ETHERNET"}
Sep 12 16:53:02.611 [TRAC] [id:81d3be5c-cad4-4811-a6ef-5c4da4306c30] [type:networkService] (53.1)                                                       request-response-diff={"mechanism":{"parameters":{"inodeURL":"file:///proc/799735/fd/16"}}}
Sep 12 16:53:02.649 [INFO] [id:81d3be5c-cad4-4811-a6ef-5c4da4306c30] [type:networkService] (42.1)                                            /proc/sys/net/ipv4/ping_group_range was set to 0 2147483647
Sep 12 16:53:02.653 [DEBU] [id:81d3be5c-cad4-4811-a6ef-5c4da4306c30] [Dst:172.16.1.1/32] [Flags:0] [Gw:<nil>] [Scope:link] [duration:531.547µs] [link.Name:local-forw-f280] [netlink:RouteReplace] [type:networkService] (38.1)                                        completed
Sep 12 16:53:02.654 [DEBU] [id:81d3be5c-cad4-4811-a6ef-5c4da4306c30] [Dst:172.16.1.1/32] [Flags:0] [Gw:<nil>] [Scope:universe] [duration:145.383µs] [link.Name:local-forw-f280] [netlink:RouteReplace] [type:networkService] (38.2)                                        completed
Sep 12 16:53:02.798 [TRAC] [id:81d3be5c-cad4-4811-a6ef-5c4da4306c30] [type:networkService] (33.2)                                   request-response-diff={"mechanism":{"parameters":{"-supportsVlan":"true","inodeURL":"file:///proc/799735/fd/13","name":"nsm-1"}}}
Sep 12 16:53:02.801 [TRAC] [id:81d3be5c-cad4-4811-a6ef-5c4da4306c30] [type:networkService] (24.1)                          request-diff={"connection":{"mechanism":{"cls":"LOCAL","parameters":{"+inodeURL":"file:///proc/799735/fd/13","+name":"nsm-1"},"type":"KERNEL"},"path":{"path_segments":{"2":{"expires":{"nanos":482070996},"token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9rOHMubnNtL25zL25zbS1zeXN0ZW0vcG9kL2ZvcndhcmRlci1vdnMtYjk3Mm4iLCJhdWQiOlsic3BpZmZlOi8vazhzLm5zbS9ucy9ucy1sb2NhbC1mb3J3YXJkZXItZGVhdGgvcG9kL25zZS1rZXJuZWwtNmQ2NDc3ODc1Zi1ndGdkNyJdLCJleHAiOjE2OTQ1MzgxODJ9.ghEOIYMc0eYbRBxo269dicBL-_40o5vPxIwGQecEAQ5roOqR1FQHvWKQ3SKU3wrGNf_ylkbykrM7nVvR2VRS1g"},"3":{"expires":{"nanos":485355075,"seconds":1694538182},"token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9rOHMubnNtL25zL25zLWxvY2FsLWZvcndhcmRlci1kZWF0aC9wb2QvbnNlLWtlcm5lbC02ZDY0Nzc4NzVmLWd0Z2Q3IiwiYXVkIjpbInNwaWZmZTovL2s4cy5uc20vbnMvbnNtLXN5c3RlbS9wb2QvZm9yd2FyZGVyLW92cy1iOTcybiJdLCJleHAiOjE2OTQ1MzgxODJ9.V7oLhExCJrFkTwZYIG-LASq5vMT4HpMW-szmg2S5HVMJ8Qyww8cRK0_sTtX7bBQz9_uRnWnJw5Pb8TViuRW4jw"}}},"state":0},"mechanism_preferences":{"0":{"parameters":{"+inodeURL":"file:///proc/799735/fd/13","+name":"nsm-1","-supportsVlan":"true"}}}}
Sep 12 16:53:02.802 [INFO] [id:81d3be5c-cad4-4811-a6ef-5c4da4306c30] [type:networkService] (23.1)                         /proc/sys/net/ipv4/ping_group_range was set to 0 2147483647
Sep 12 16:53:03.819 [DEBU] [id:81d3be5c-cad4-4811-a6ef-5c4da4306c30] [Dst:172.16.1.0/32] [Flags:0] [Gw:<nil>] [Scope:link] [duration:127.581µs] [link.Name:nsm-1] [netlink:RouteReplace] [type:networkService] (21.1)                       completed
Sep 12 16:53:03.819 [DEBU] [id:81d3be5c-cad4-4811-a6ef-5c4da4306c30] [Dst:172.16.1.0/32] [Flags:0] [Gw:<nil>] [Scope:universe] [duration:85.979µs] [link.Name:nsm-1] [netlink:RouteReplace] [type:networkService] (21.2)                       completed
Sep 12 16:53:03.824 [TRAC] [id:81d3be5c-cad4-4811-a6ef-5c4da4306c30] [type:networkService] (11.1)             request-response-diff={"mechanism":{"parameters":{"inodeURL":"inode://4/4026532527"}}}
Sep 12 16:51:38.061 [ERRO] [id:alpine-b626de3e-acbb-4b4e-9c31-a81a43ca7f2b-0] [dstIP:172.16.1.0] [srcIP:172.16.1.1] [type:networkService] (7.1)         No packets received
Sep 12 16:51:38.061 [WARN] [id:alpine-b626de3e-acbb-4b4e-9c31-a81a43ca7f2b-0] [heal:eventLoop] [type:networkService] (7.2)         Data plane is down
Sep 12 16:51:38.061 [DEBU] [id:alpine-b626de3e-acbb-4b4e-9c31-a81a43ca7f2b-0] [heal:eventLoop] [type:networkService] (7.3)         Reconnect with reselect
Sep 12 16:51:38.061 [DEBU] [id:alpine-b626de3e-acbb-4b4e-9c31-a81a43ca7f2b-0] [heal:eventLoop] [type:networkService] (7.4)         received eventIn: <nil>, err: rpc error: code = Canceled desc = context canceled

alpine-cmd-nsc.txt

NikitaSkrynnik commented 1 year ago

@ljkiraly Hello, it looks like there is a problem with closing a connection. I tested it the next way:

I modified NSC to do:

resp, err := nsmClient.Request(ctx, request)
if err != nil {
    logger.Fatalf("failed connect to NSMgr: %v", err.Error())
}
logger.Infof("successfully connected to %v. Response: %v", u.NetworkService(), resp)

time.Sleep(time.Minute)
_, err = nsmClient.Close(ctx, resp)
if err != nil {
    logger.Infof("Error while closing a connection to %v. %s", u.NetworkService(), err.Error())
}

time.Sleep(time.Second * 15)
resp, err = nsmClient.Request(ctx, request)
if err != nil {
    logger.Fatalf("failed connect to NSMgr: %v", err.Error())
}

NSC fails to do the second request because there is an error in cmd-forwarder-ovs:

failed to find link local-forw-de4f: Link not found

If on the second request I use a new request, it passes without any errors on cmd-forwarder-ovs:

newID := "id-1"
request1 := &networkservice.NetworkServiceRequest{
    Connection: &networkservice.Connection{
        Id:             newID,
        NetworkService: u.NetworkService(),
        Labels:         u.Labels(),
    },
    MechanismPreferences: []*networkservice.Mechanism{
        u.Mechanism(),
    },
}

resp1, err := nsmClient.Request(ctx, request1)
if err != nil {
    logger.Fatalf("failed connect to NSMgr: %v", err.Error())
}

Could you please check all Closes in cmd-forwarder-ovs chain? It looks like cmd-forwarder-ovs doesn't delete something on Close.