nginxinc / nginx-service-mesh

A service mesh powered by NGINX Plus to manage container traffic in Kubernetes environments.
https://docs.nginx.com/nginx-service-mesh
Apache License 2.0
93 stars 30 forks source link

nginx-mesh-cert-reloader-init container failing in nats-server pod #62

Closed pbhalera closed 1 year ago

pbhalera commented 2 years ago

I tried installing nginx service mesh via the helm chart on a k8s 1.21.6-gke.1503 cluster in GKE, but the nats-server pod is stuck in Init: 0/1. I found following in the nginx-mesh-cert-reloader-init init container log

X509SVIDClient cannot connect to the Spire agent: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory". For more information check the logs of the Spire agents and server. X509SVIDClient still waiting for certificates: rpc error: code = PermissionDenied desc = no identity issued. For more information check the logs of the Spire agents and server

on spire-agent i found following error:- time="2022-04-18T23:05:15Z" level=error msg="No identity issued" method=FetchX509SVID pid=1333269 registered=false service=WorkloadAPI subsystem_name=endpoints

On spire-server, there is no error for server. But when checked logs for k8s-workload-registrar container i found following

I0419 05:58:55.799989 41 request.go:665] Waited for 1.028823161s due to client-side throttling, not priority and fairness, request: GET:https://X.X.X.X:443/apis/rbac.authorization.k8s.io/v1?timeout=32s

can you please help me resolve this. any pointers are appritiated. I used 1.4.0 service mesh version.

pbhalera commented 2 years ago

Any help would be appreciated.I am blocked on it currently

kate-osborn commented 2 years ago

Hi @pbhalera,

Can you please provide the following:

Thanks!

jbyers19 commented 2 years ago

@pbhalera, looks like there is an RBAC permissions issue. Does the account you are using to deploy the mesh have admin access to the cluster? If you have the gcloud CLI tool installed, you can create a clusterrolebinding for your account with this command:

kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value core/account)

Alternatively you can pass your user ID directly to the --user flag.

After adding the clusterrolebinding, make sure all the mesh resources are removed and try deploying again.

pbhalera commented 2 years ago

@jbyers19 i did the rolebinding but no luck. Also i just tried the deployment commands from doc with helm chart no additional configuration. Also it didnt create any spiffeids.spiffeid.spiffe.io.

Here are the logs kubectl logs -n spire spire-server-0 --all-containers=true time="2022-04-19T06:20:56Z" level=warning msg="Current umask 0022 is too permissive; setting umask 0027" time="2022-04-19T06:20:56Z" level=info msg=Configured admin_ids="[]" data_dir=/run/spire/data time="2022-04-19T06:20:56Z" level=info msg="Opening SQL database" db_type=sqlite3 subsystem_name=sql time="2022-04-19T06:20:56Z" level=debug msg="Code and DB schema versions are the same. No migration needed" schema=18 subsystem_name=sql version_info=1.2.3 time="2022-04-19T06:20:56Z" level=info msg="Connected to SQL database" read_only=false subsystem_name=sql type=sqlite3 version=3.38.0 time="2022-04-19T06:20:56Z" level=info msg="Plugin loaded" external=false plugin_name=k8s_psat plugin_type=NodeAttestor subsystem_name=catalog time="2022-04-19T06:20:56Z" level=info msg="Plugin loaded" external=false plugin_name=disk plugin_type=KeyManager subsystem_name=catalog time="2022-04-19T06:20:56Z" level=info msg="Plugin loaded" external=false plugin_name=k8sbundle plugin_type=Notifier subsystem_name=catalog time="2022-04-19T06:20:56Z" level=debug msg="Loading journal" path=/run/spire/data/journal.pem subsystem_name=ca_manager time="2022-04-19T06:20:56Z" level=info msg="Journal loaded" jwt_keys=1 subsystem_name=ca_manager x509_cas=1 time="2022-04-19T06:20:56Z" level=info msg="X509 CA activated" expiration="2022-04-20T06:20:23Z" issued_at="2022-04-19T06:20:23Z" slot=A subsystem_name=ca_manager time="2022-04-19T06:20:56Z" level=debug msg="Successfully rotated X.509 CA" subsystem_name=ca_manager trust_domain_id="spiffe://example.org" ttl=86366.244387357 time="2022-04-19T06:20:56Z" level=info msg="JWT key activated" expiration="2022-04-20T06:20:23Z" issued_at="2022-04-19T06:20:23Z" slot=A subsystem_name=ca_manager time="2022-04-19T06:20:56Z" level=debug msg="Rotating server SVID" subsystem_name=svid_rotator time="2022-04-19T06:20:56Z" level=debug msg="Signed X509 SVID" expiration="2022-04-19T07:20:56Z" spiffe_id="spiffe://example.org/spire/server" subsystem_name=svid_rotator time="2022-04-19T06:20:56Z" level=info msg="Building in-memory entry cache" subsystem_name=endpoints time="2022-04-19T06:20:56Z" level=info msg="Completed building in-memory entry cache" subsystem_name=endpoints time="2022-04-19T06:20:56Z" level=debug msg="Initializing API endpoints" subsystem_name=endpoints time="2022-04-19T06:20:56Z" level=info msg="Starting UDS server" address=/tmp/spire-server/private/api.sock subsystem_name=endpoints time="2022-04-19T06:20:56Z" level=info msg="Starting TCP server" address="[::]:8081" subsystem_name=endpoints time="2022-04-19T06:20:56Z" level=debug msg="Starting checker" name=catalog.datastore subsystem_name=health time="2022-04-19T06:20:56Z" level=debug msg="Starting checker" name=server.ca subsystem_name=health time="2022-04-19T06:20:56Z" level=debug msg="Starting checker" name=server.ca.manager subsystem_name=health time="2022-04-19T06:20:56Z" level=debug msg="Starting checker" name=server subsystem_name=health time="2022-04-19T06:20:56Z" level=debug msg="Notifier handled event" event="bundle loaded" notifier=k8sbundle subsystem_name=ca_manager time="2022-04-19T06:20:58Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=k8s-workload-registrar.spire.svc expiration="2022-04-19T07:20:58Z" method=MintX509SVID request_id=a8bb2268-5aef-43eb-a9c2-17dd17098e8d service=svid.v1.SVID spiffe_id="spiffe://example.org/k8s-workload-registrar/demo-cluster/webhook" subject= subsystem_name=api time="2022-04-19T06:20:56Z" level=info msg="Connecting to local registration server socket unix:///tmp/spire-server/private/api.sock" I0419 06:20:57.770866 38 request.go:665] Waited for 1.040246717s due to client-side throttling, not priority and fairness, request: GET:https://x.x.x.x:443/apis/rbac.authorization.k8s.io/v1?timeout=32s time="2022-04-19T06:20:58Z" level=info msg="Initializing SPIFFE ID CRD Mode" time="2022-04-19T06:20:58Z" level=debug msg="Minting new SVID for webhook"

kate-osborn commented 2 years ago

@pbhalera does the nats-server Pod have the label spiffe.io/spiffeid: "true"?

pbhalera commented 2 years ago

Yes. but It seems like k8s-workload-registrar webhook is not scanning it and doesnt issue cert to it.

kate-osborn commented 2 years ago

Right, there should be a spiffe id for the nats-server Pod.

Did you install via the helm repository or with the chart source?

kate-osborn commented 2 years ago

I'm not able to reproduce this error on a standard 1.21.6-gke.1503 cluster. I was able to helm install release 1.4 of the mesh using both methods outlined in the instructions.

Are there any special network or security settings you have configured in your GKE cluster?

pbhalera commented 2 years ago

Not really. I am able to deploy 1.1.0 version without any issue. Only facing error when i try to deploy 1.4 or 1.2.x or 1.3.x

kate-osborn commented 2 years ago

Can you please share the command you are using to install the mesh?

pbhalera commented 2 years ago

helm install nsm nginx-stable/nginx-service-mesh --namespace nginx-mesh --create-namespace --wait

kate-osborn commented 2 years ago

Thanks. Do you also have the spire-server running in the spire namespace?

The logs you posted earlier were from a spire-server in ns spire.

pbhalera commented 2 years ago

No. I removed. The logs which i got as i was trying installing spire separately and testing it

pbhalera commented 2 years ago

I uninstall the 1.1.0 and again installed the 1.4 with helm and here is the logs for spire server time="2022-04-20T21:29:46Z" level=warning msg="Current umask 0022 is too permissive; setting umask 0027" time="2022-04-20T21:29:46Z" level=info msg=Configured admin_ids="[]" data_dir=/run/spire/data time="2022-04-20T21:29:46Z" level=info msg="Opening SQL database" db_type=sqlite3 subsystem_name=sql time="2022-04-20T21:29:46Z" level=info msg="Initializing new database" subsystem_name=sql time="2022-04-20T21:29:46Z" level=info msg="Connected to SQL database" read_only=false subsystem_name=sql type=sqlite3 version=3.36.0 time="2022-04-20T21:29:46Z" level=info msg="Plugin loaded" external=false plugin_name=k8s_psat plugin_type=NodeAttestor subsystem_name=catalog time="2022-04-20T21:29:46Z" level=info msg="Plugin loaded" external=false plugin_name=k8sbundle plugin_type=Notifier subsystem_name=catalog time="2022-04-20T21:29:46Z" level=info msg="Plugin loaded" external=false plugin_name=disk plugin_type=KeyManager subsystem_name=catalog time="2022-04-20T21:29:46Z" level=debug msg="Loading journal" path=/run/spire/data/journal.pem subsystem_name=ca_manager time="2022-04-20T21:29:46Z" level=info msg="Journal loaded" jwt_keys=0 subsystem_name=ca_manager x509_cas=0 time="2022-04-20T21:29:46Z" level=debug msg="Preparing X509 CA" slot=A subsystem_name=ca_manager time="2022-04-20T21:29:46Z" level=info msg="X509 CA prepared" expiration="2022-05-20T21:29:46Z" issued_at="2022-04-20T21:29:46Z" self_signed=true slot=A subsystem_name=ca_manager time="2022-04-20T21:29:46Z" level=info msg="X509 CA activated" expiration="2022-05-20T21:29:46Z" issued_at="2022-04-20T21:29:46Z" slot=A subsystem_name=ca_manager time="2022-04-20T21:29:46Z" level=debug msg="Successfully rotated X.509 CA" subsystem_name=ca_manager trust_domain_id="spiffe://example.org" ttl=2.591999937447756e+06 time="2022-04-20T21:29:46Z" level=debug msg="Preparing JWT key" slot=A subsystem_name=ca_manager time="2022-04-20T21:29:46Z" level=info msg="JWT key prepared" expiration="2022-05-20T21:29:46Z" issued_at="2022-04-20T21:29:46Z" slot=A subsystem_name=ca_manager time="2022-04-20T21:29:46Z" level=info msg="JWT key activated" expiration="2022-05-20T21:29:46Z" issued_at="2022-04-20T21:29:46Z" slot=A subsystem_name=ca_manager time="2022-04-20T21:29:46Z" level=debug msg="Rotating server SVID" subsystem_name=svid_rotator time="2022-04-20T21:29:46Z" level=debug msg="Signed X509 SVID" expiration="2022-04-20T22:29:46Z" spiffe_id="spiffe://example.org/spire/server" subsystem_name=svid_rotator time="2022-04-20T21:29:46Z" level=info msg="Building in-memory entry cache" subsystem_name=endpoints time="2022-04-20T21:29:46Z" level=info msg="Completed building in-memory entry cache" subsystem_name=endpoints time="2022-04-20T21:29:46Z" level=debug msg="Initializing API endpoints" subsystem_name=endpoints time="2022-04-20T21:29:46Z" level=info msg="Starting UDS server" address=/run/spire/sockets/spire-registration.sock subsystem_name=endpoints time="2022-04-20T21:29:46Z" level=info msg="Starting TCP server" address="[::]:8081" subsystem_name=endpoints time="2022-04-20T21:29:46Z" level=debug msg="Notifier handled event" event="bundle loaded" notifier=k8sbundle subsystem_name=ca_manager time="2022-04-20T21:29:47Z" level=debug msg="Starting checker" name=catalog.datastore subsystem_name=health time="2022-04-20T21:29:47Z" level=debug msg="Starting checker" name=server.ca subsystem_name=health time="2022-04-20T21:29:47Z" level=debug msg="Starting checker" name=server.ca.manager subsystem_name=health time="2022-04-20T21:29:47Z" level=debug msg="Starting checker" name=server subsystem_name=health time="2022-04-20T21:29:47Z" level=info msg="Serving health checks" address="0.0.0.0:8082" subsystem_name=health time="2022-04-20T21:29:52Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=k8s-workload-registrar.nginx-mesh.svc expiration="2022-04-20T22:29:52Z" method=MintX509SVID request_id=00ac04d9-a4e0-4119-b243-8b4a14a86527 service=svid.v1.SVID spiffe_id="spiffe://example.org/k8s-workload-registrar/nginx-mesh/webhook" subject= subsystem_name=api time="2022-04-20T21:29:54Z" level=info msg="Agent attestation request completed" address="10.27.160.28:59019" agent_id="spiffe://example.org/spire/agent/k8s_psat/nginx-mesh/0d579b7d-596b-4b32-ba56-7f42c31f8f4b" authorized_as=nobody authorized_via= caller_addr="10.27.160.28:59019" method=AttestAgent node_attestor_type=k8s_psat request_id=a1b2a974-c791-4966-b908-efcae4abe096 service=agent.v1.Agent subsystem_name=api time="2022-04-20T21:29:54Z" level=info msg="Agent attestation request completed" address="172.20.0.1:35165" agent_id="spiffe://example.org/spire/agent/k8s_psat/nginx-mesh/e083743f-5c32-4e9c-8d5d-e5eb09d6c775" authorized_as=nobody authorized_via= caller_addr="172.20.0.1:35165" method=AttestAgent node_attestor_type=k8s_psat request_id=9b361efb-f7bf-4829-be1a-b2bb6f051dbd service=agent.v1.Agent subsystem_name=api time="2022-04-20T21:29:54Z" level=info msg="Agent attestation request completed" address="10.27.160.26:49693" agent_id="spiffe://example.org/spire/agent/k8s_psat/nginx-mesh/09febe13-9891-4dfb-a793-2e34bf2df9fe" authorized_as=nobody authorized_via= caller_addr="10.27.160.26:49693" method=AttestAgent node_attestor_type=k8s_psat request_id=34ff9b6a-5d81-4742-bfa4-71ca40e048d8 service=agent.v1.Agent subsystem_name=api time="2022-04-20T21:29:55Z" level=info msg="Agent attestation request completed" address="10.27.160.8:17467" agent_id="spiffe://example.org/spire/agent/k8s_psat/nginx-mesh/7385cc3a-f8f5-4ede-b993-150162a61650" authorized_as=nobody authorized_via= caller_addr="10.27.160.8:17467" method=AttestAgent node_attestor_type=k8s_psat request_id=8f3af968-6940-480e-b8d1-ff598f790120 service=agent.v1.Agent subsystem_name=api time="2022-04-20T21:29:55Z" level=info msg="Agent attestation request completed" address="10.27.160.48:51805" agent_id="spiffe://example.org/spire/agent/k8s_psat/nginx-mesh/3a4047cc-e0e4-472d-a6ee-3294e10315b9" authorized_as=nobody authorized_via= caller_addr="10.27.160.48:51805" method=AttestAgent node_attestor_type=k8s_psat request_id=017b4416-362f-45ae-8cef-01c42b8fe090 service=agent.v1.Agent subsystem_name=api time="2022-04-20T21:29:55Z" level=info msg="Agent attestation request completed" address="10.27.160.41:60638" agent_id="spiffe://example.org/spire/agent/k8s_psat/nginx-mesh/f23ead58-8fe8-476f-b290-3161fdfcaeb8" authorized_as=nobody authorized_via= caller_addr="10.27.160.41:60638" method=AttestAgent node_attestor_type=k8s_psat request_id=304ea002-3363-412a-a998-a7ec9a85aa6c service=agent.v1.Agent subsystem_name=api time="2022-04-20T21:29:55Z" level=info msg="Agent attestation request completed" address="10.27.160.7:48744" agent_id="spiffe://example.org/spire/agent/k8s_psat/nginx-mesh/ab408c2f-1c4c-4d81-81fe-7a4ab099958d" authorized_as=nobody authorized_via= caller_addr="10.27.160.7:48744" method=AttestAgent node_attestor_type=k8s_psat request_id=ab76f27a-d322-484d-b04a-7a05a681eb42 service=agent.v1.Agent subsystem_name=api time="2022-04-20T21:29:55Z" level=info msg="Agent attestation request completed" address="10.27.160.12:39284" agent_id="spiffe://example.org/spire/agent/k8s_psat/nginx-mesh/84b75f69-e613-47e7-bf62-7a3d93336360" authorized_as=nobody authorized_via= caller_addr="10.27.160.12:39284" method=AttestAgent node_attestor_type=k8s_psat request_id=9284f040-4071-44ec-9354-a8454120a80b service=agent.v1.Agent subsystem_name=api time="2022-04-20T21:29:55Z" level=info msg="Agent attestation request completed" address="10.27.160.51:22730" agent_id="spiffe://example.org/spire/agent/k8s_psat/nginx-mesh/ef9c775f-79f6-4002-9135-72c6e06accad" authorized_as=nobody authorized_via= caller_addr="10.27.160.51:22730" method=AttestAgent node_attestor_type=k8s_psat request_id=a2afb8ac-f7a0-45d4-bdab-05ab4890aaf8 service=agent.v1.Agent subsystem_name=api time="2022-04-20T21:29:46Z" level=info msg="Connecting to local registration server socket unix:///run/spire/sockets/spire-registration.sock" I0420 21:29:47.162803 39 request.go:665] Waited for 1.040539047s due to client-side throttling, not priority and fairness, request: GET:https://172.18.0.1:443/apis/kafka.strimzi.io/v1alpha1?timeout=32s time="2022-04-20T21:29:52Z" level=info msg="Initializing SPIFFE ID CRD Mode" time="2022-04-20T21:29:52Z" level=debug msg="Minting new SVID for webhook"

kate-osborn commented 2 years ago

This log is interesting: GET:https://172.18.0.1:443/apis/kafka.strimzi.io/v1alpha1?timeout=32s time="2022-04-20T21:29:52Z"

Do you have an Apache Kafka cluster running?

pbhalera commented 2 years ago

Yes.

But the throttling error shows specific to api server by getting different kinds

kate-osborn commented 2 years ago

Have you modified the spire-server configuration in any way? Are you trying to integrate spire with kafka?

I'm trying to figure out what is different about your installation than a standard mesh installation on 1.21.6-gke.1503 cluster.

sjberman commented 1 year ago

Closing this as stale.