shieldworks / aegis

Aegis: Keep Your Secrets… Secret
https://vsecm.com
MIT License
31 stars 3 forks source link

Aegis Safe v0.18.0 appears to be broken #476

Closed v0lkan closed 1 year ago

v0lkan commented 1 year ago

Readiness probe failing; could be related to an environment configuration.

                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  34s               default-scheduler  Successfully assigned aegis-system/aegis-safe-6b4bc89c78-jdbjs to minikube
  Normal   Pulled     3s (x2 over 33s)  kubelet            Container image "localhost:5000/aegis-ist-safe:0.18.0" already present on machine
  Normal   Created    3s (x2 over 33s)  kubelet            Created container main
  Normal   Started    3s (x2 over 33s)  kubelet            Started container main
  Warning  Unhealthy  1s (x7 over 32s)  kubelet            Readiness probe failed: Get "http://10.244.0.9:8082/": dial tcp 10.244.0.9:8082: connect: connection refused
v0lkan commented 1 year ago

Bug confirmed.

There seeems to be something with the bootstrapping sequence:

~/WORKSPACE/aegis (main) 🐢⚡️ k logs aegis-safe-6b4bc89c78-7gpl5 -n aegis-system
2023/07/09 20:15:12 [INFO ] AEGISAFE Acquiring identity…
2023/07/09 20:15:12 [INFO ] AEGISAFE Secret has been set in the cluster, will reuse it
2023/07/09 20:15:12 [INFO ] AEGISAFE Updated age key.
2023/07/09 20:15:12 [INFO ] AEGISAFE remaining: 2
v0lkan commented 1 year ago

more; spire server acting funky:

~/WORKSPACE/aegis (main) 🐢⚡️ k logs  spire-server-6fb4f57c8-6dcpc -n spire-system
Defaulted container "spire-server" out of: spire-server, spire-controller-manager
time="2023-07-09T20:14:32Z" level=warning msg="Current umask 0022 is too permissive; setting umask 0027"
time="2023-07-09T20:14:32Z" level=info msg=Configured admin_ids="[]" data_dir=/run/spire/server/data
time="2023-07-09T20:14:32Z" level=info msg="Opening SQL database" db_type=sqlite3 subsystem_name=sql
time="2023-07-09T20:14:32Z" level=info msg="Initializing new database" subsystem_name=sql
time="2023-07-09T20:14:32Z" level=info msg="Connected to SQL database" read_only=false subsystem_name=sql type=sqlite3 version=3.39.4
time="2023-07-09T20:14:32Z" level=info msg="Plugin loaded" external=false plugin_name=k8s_psat plugin_type=NodeAttestor subsystem_name=catalog
time="2023-07-09T20:14:32Z" level=info msg="Plugin loaded" external=false plugin_name=disk plugin_type=KeyManager subsystem_name=catalog
time="2023-07-09T20:14:32Z" level=info msg="Plugin loaded" external=false plugin_name=k8sbundle plugin_type=Notifier subsystem_name=catalog
time="2023-07-09T20:14:32Z" level=debug msg="Loading journal" path=/run/spire/server/data/journal.pem subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=info msg="Journal loaded" jwt_keys=0 subsystem_name=ca_manager x509_cas=0
time="2023-07-09T20:14:32Z" level=debug msg="Preparing X509 CA" slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=info msg="X509 CA prepared" expiration="2023-07-10 20:14:32 +0000 UTC" issued_at="2023-07-09 20:14:32.460960018 +0000 UTC" self_signed=true slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=info msg="X509 CA activated" expiration="2023-07-10 20:14:32 +0000 UTC" issued_at="2023-07-09 20:14:32.460960018 +0000 UTC" slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=debug msg="Successfully rotated X.509 CA" subsystem_name=ca_manager trust_domain_id="spiffe://aegis.ist" ttl=86399.531431212
time="2023-07-09T20:14:32Z" level=debug msg="Preparing JWT key" slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=info msg="JWT key prepared" expiration="2023-07-10 20:14:32.468688879 +0000 UTC" issued_at="2023-07-09 20:14:32.468688879 +0000 UTC" slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=info msg="JWT key activated" expiration="2023-07-10 20:14:32.468688879 +0000 UTC" issued_at="2023-07-09 20:14:32.468688879 +0000 UTC" slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=debug msg="Rotating server SVID" subsystem_name=svid_rotator
time="2023-07-09T20:14:32Z" level=debug msg="Signed X509 SVID" expiration="2023-07-09T21:14:32Z" spiffe_id="spiffe://aegis.ist/spire/server" subsystem_name=svid_rotator
time="2023-07-09T20:14:32Z" level=info msg="Building in-memory entry cache" subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=info msg="Completed building in-memory entry cache" subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=info msg="Serving bundle endpoint" addr="0.0.0.0:8443" subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=debug msg="Initializing API endpoints" subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=info msg="Starting Server APIs" address="[::]:8081" network=tcp subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=info msg="Starting Server APIs" address=/tmp/spire-server/private/api.sock network=unix subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=debug msg="Initializing health checkers" subsystem_name=health
time="2023-07-09T20:14:32Z" level=info msg="Serving health checks" address="0.0.0.0:8080" subsystem_name=health
time="2023-07-09T20:14:32Z" level=debug msg="Notifier handled event" event="bundle loaded" notifier=k8sbundle subsystem_name=ca_manager
time="2023-07-09T20:14:35Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=cc9eede1-70de-4a89-9ce4-45fd31fe762e serial_num=220647600972895169213478345007152105879 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:14:38Z" level=info msg="Agent attestation request completed" address="10.244.0.1:24714" agent_id="spiffe://aegis.ist/spire/agent/k8s_psat/aegis-cluster/58ec9779-b69b-467f-8daa-6d425168e242" authorized_as=nobody authorized_via= caller_addr="10.244.0.1:24714" method=AttestAgent node_attestor_type=k8s_psat request_id=2213b68f-889d-42f5-b93e-47d7f71f60e6 service=agent.v1.Agent subsystem_name=api
time="2023-07-09T20:16:36Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=5ae39f53-be75-4331-b0a9-c6424eff2f67 serial_num=134583869554082741372914276644777735286 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:19:09Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=f6b507ca-e407-41b1-a50c-72d03c0e5a86 serial_num=274273112340796017457525047861804262516 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:21:51Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=006bbfc9-29c3-40a4-86cf-0f9b619a062f serial_num=161114409812572787252179367528088440062 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:25:00Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=6b1e06f4-bc86-4cca-8a8b-e26813420e50 serial_num=308959287486341623984535810580801604004 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:28:45Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=b9f57687-f86f-4187-b0dd-5ade78ad4718 serial_num=209113354354863514649845506184842233746 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:33:43Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=7585fd84-888b-4ea0-8009-66d554bf682b serial_num=73387018949878136766962404638821297169 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:41:08Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=76b317a5-9da4-401d-9b0f-dbd1893dfc5d serial_num=226485357373726696769043468056246421737 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api

That might also be the reason for safe bootstrap failing.

more:

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  36m                 default-scheduler  Successfully assigned spire-system/spire-server-6fb4f57c8-6dcpc to minikube
  Normal   Pulling    36m                 kubelet            Pulling image "ghcr.io/spiffe/spire-server:1.6.3"
  Normal   Pulled     36m                 kubelet            Successfully pulled image "ghcr.io/spiffe/spire-server:1.6.3" in 2.192539709s (3.846152012s including waiting)
  Normal   Created    36m                 kubelet            Created container spire-server
  Normal   Started    36m                 kubelet            Started container spire-server
  Normal   Pulling    36m                 kubelet            Pulling image "ghcr.io/spiffe/spire-controller-manager:nightly"
  Normal   Pulled     36m                 kubelet            Successfully pulled image "ghcr.io/spiffe/spire-controller-manager:nightly" in 2.192111448s (2.963491192s including waiting)
  Normal   Created    26m (x5 over 36m)   kubelet            Created container spire-controller-manager
  Normal   Started    26m (x5 over 36m)   kubelet            Started container spire-controller-manager
  Normal   Pulled     26m (x4 over 34m)   kubelet            Container image "ghcr.io/spiffe/spire-controller-manager:nightly" already present on machine
  Warning  BackOff    23s (x75 over 32m)  kubelet            Back-off restarting failed container spire-controller-manager in pod spire-server-6fb4f57c8-6dcpc_spire-system(ed1688e0-1e49-4beb-9585-dbcedebd4af3)

can be related to spire-controller-manager too.

v0lkan commented 1 year ago

I’ll check if there is an update version for SPIRE server.

v0lkan commented 1 year ago

FYI @abhishek44sharma this is related to the issue that you are having:

https://github.com/spiffe/spire-controller-manager/issues/177

To fix this,

1) we need to add the CRD to the yaml files 2) and we need to fix the version of the controller manager to something other than nightly so that similar breaking changes do not impact Aegis Safe.

abhishek44sharma commented 1 year ago

/assign

abhishek44sharma commented 1 year ago

This issue is fixed. For aegis-safe readiness probe failure event logged, since initially it takes time to acquire svid and update secret. Caused delay in readiness probe creation. Eventually readiness probe is coming up fine now. To avoid readiness probe failure event getting logged we can increase initialDelaySeconds time in aegis-safe deployment to avoid this event.

abhishek44sharma commented 1 year ago

This issue is fixed. For aegis-safe readiness probe failure event is logged, since initially it takes time to acquire svid and update secret. Causes delay in readiness probe creation. Eventually readiness probe is coming up fine now. To avoid readiness probe failure event getting logged, we can increase initialDelaySeconds time for readiness probe in aegis-safe deployment.