networkservicemesh / deployments-k8s

Apache License 2.0
42 stars 34 forks source link

NSC, Error from monitorConnection, RPC error code: PermissionDenied, Desc: no sufficient privileges #7062

Open bilgehan-erman opened 2 years ago

bilgehan-erman commented 2 years ago

The setup is build on

The platform is k8s 1.23, docker 20.10, on ubuntu 20.04, NSM 1.5.0

Things work as expected with the existing scripts -- NSC and NSE on their own, separate pods. (See related Issue #7051)

However, when we try to build NSC and NSE into the same container, we keep getting the NSC error:

[ERRO] [cmd:[/bin/nsc]] error from monitorConnection stream %!(EXTRA string=rpc error: code = PermissionDenied desc = no sufficient privileges)

Could not find any more information on the source of the error.

(It may not be related but it is very difficult for us to understand the whole spire/nsm integration, how things get worked into the containers, etc. We are really looking forward for some documentation on this.)

The Dockerfile that builds the node that has both the NSC and NSE is configured as:

FROM golang:1.18.2-buster as go
...
RUN tar xzvf spire-1.2.2-linux-x86_64-glibc.tar.gz -C /bin --strip=2 spire-1.2.2/bin/spire-server spire-1.2.2/bin/spire-agent

FROM go as nsc
...
RUN go build -o /nsc .

FROM go as nse
...
RUN go build -o /nse .

FROM ubuntu:18.04
...
COPY --from=nsc /nsc /bin/nsc
COPY --from=nse /nse /bin/nse
...
CMD ["/usr/bin/supervisord"]

The use case is to build topologies with nodes that have both NSC and NSE dynamic capabilities; many nodes with a single container; payload ETHERNET. Therefore, workarounds like sidecars, separate nsc/nse roles are not workable options.

Any help will be very much appreciated. Thank you in advance.

nsc-rpc-auth-problem.txt nsc.log nse.log kubectl-get-pods.txt registry-k8s.log forwarder-vpp.log forwarder-sriov.log nsmgr-nsmgr.log nsm-test-setup.zip

bilgehan-erman commented 2 years ago

In the provided test results, (unintentionally) both the nsc and the nse nodes provide the "icmp-responder" service (that was originating from the default Config). We corrected this after the logs were captured. Same result. Still get the error. Also in the actual topology configuration there are no redundant service offers. Although each node provide and consume services, each service name is unique, based on the node id.

denis-tingaikin commented 2 years ago

/cc @glazychev-art , @anastasia-malysheva May this be related to monitor OPA staff?

glazychev-art commented 2 years ago

Thanks for the detailed information!

Most likely yes, it is related to OPA for monitoring. But it is not actually an error if you are not using init-container (cmd-nsc-init). We probably need to rewrite the error message in order not to mislead anyone.

edwarnicke commented 2 years ago

@glazychev-art Any idea what the root cause might be? I'm not entirely sure why we would be seeing this, do you have a more specific idea?

glazychev-art commented 2 years ago

@edwarnicke cmd-nsc does Monitor connections before the Request. This is necessary to take the connection if there was cmd-nsc-init container before. But as you know, you've implemented an open-policy for monitoring, and it is based on the spiffieID from the Request. But if we did not have an init container, then there were no Requests either. And an authorization error is returned.

But here the problem is different - as I see from the logs, there are many healing errors:

Aug 15 22:00:44.223 [WARN] [id:nsc-858c5dc57-2bf6l-0] [heal:eventLoop] [type:networkService] (7.1)         Data plane is down
Aug 15 22:00:44.223 [DEBU] [id:nsc-858c5dc57-2bf6l-0] [heal:eventLoop] [type:networkService] (7.2)         Reconnect with reselect

need to figure out why this is happening

denis-tingaikin commented 2 years ago

That's an interesting scenario where we're trying to run nsc/nse together in the same container.

@bilgehan-erman

  1. Do you have a diagram/scheme/proposal what do you finally get?
  2. Did ping work for your scenario?
  3. I think that the problem with authz is related to dp healing. Could you re-test the setup with disabled dp heal https://github.com/networkservicemesh/cmd-nsc/blob/main/internal/config/config.go#L49? (means set env NSM_ LIVENESS_CHECK_ENABLED=false)
bilgehan-erman commented 2 years ago

@denis-tingaikin

Unfortunately, setting the LIVENESS to false did not seem to help: nsc.log nse.log

Did ping work for your scenario?

At the NSC, cannot get to a point to try anything because of the error.

Do you have a diagram/scheme/proposal what do you finally get?

This is the test scenario:

nsm-test

And this would be an example of building random topologies using these universal nodes:

topology

glazychev-art commented 2 years ago

@bilgehan-erman Sorry to keep you waiting

I looked at your setup and logs, and I think I understand what's going on. The main reason is that the NSC is trying to connect to itself (to its endpoint). By the way, I'm not sure that this is possible due to routing... But you need a different scenario. I think selectors can help you with this. I prepared an example according to your last picture:

  1. nsc1 ---> nsc2
  2. nsc1 ---> nsc3
  3. nsc2 ---> nsc3

When I say nsc I mean nsc+nse on the same pod (like your "node"). I did not make new image, I just added the client and endpoint as different containers in one pod. Most likely yamls can be simplified, I just want to show the idea:

  1. We can separately declare a network service and specify selectors there (netsvc.yaml).
  2. We have 3 pods where we label NSEs (NSM_LABELS: "dst_endpoint:node*")
  3. And also specify labels on the NSC, saying who we want to connect to (for example kernel://icmp-responder/nsm-1-2?dst_endpoint=node2). Due to the selectors and labels, we will be able to select the desired endploint.

So, to try you need:

  1. Deploy spire
  2. Deploy basic NSM
  3. kubectl create ns ns-topology
  4. Apply kustomize file from nsc_nse_setup.zip

nsc_nse_setup.zip (I used the main branch, but I think it will work on 1.5.0 too)

I really hope this helps!

bilgehan-erman commented 2 years ago

@glazychev-art thank you very much for looking into this. I'll try it out your suggestions and see how it goes.