[v2] Multiple crashes of the router pod before network stabilizes

ted-ross commented 1 month ago

Describe the bug Using a scripted two-site demonstration setup (script included below), the sites initialize but the router pods go into crash-loop-backoff and restart twice before the network finally stablizes.

How To Reproduce On a cluster, create namespaces demo-dmz-a and demo-dmz-b and run the provided setup script. Watch the pods and sites to observe crashes prior to network stabilization.

Expected behavior I expect the network to stabilize in an orderly fashion without seeing crash indications.

Environment details

Skupper CLI: None used
Skupper Operator (if applicable): head of the v2 branch
Platform: openshift

Additional context The error seen in the router log prior to crashing:

2024-10-11 16:02:24.121816 +0000 ROUTER (critical) Router start-up failed: Python: CError: Configuration: Failed to configure TLS caCertFile '/etc/skupper-router-certs/skupper-site-server/ca.crt' from sslProfile 'skupper-site-server'

The script used to reproduce:

NS_PREFIX=demo

for i in dmz-a dmz-b; do
echo Create site $i in namespace ${NS_PREFIX}-$i
cat >temp <<EOF
apiVersion: skupper.io/v1alpha1
kind: Site
metadata:
  name: $i
spec:
  routerMode: "interior"
  ha: false
---
apiVersion: skupper.io/v1alpha1
kind: RouterAccess
metadata:
  name: $i-peer
spec:
  generateTlsCredentials: true
  issuer: skupper-site-ca
  roles:
  - name: inter-router
    port: 55671
  tlsCredentials: skupper-site-server
---
apiVersion: skupper.io/v1alpha1
kind: AccessGrant
metadata:
  name: $i-grant
spec:
  redemptionsAllowed: 10
  expirationWindow: 1h
EOF
kubectl apply -n ${NS_PREFIX}-$i -f temp

echo Generating access token token-$i.yaml
kubectl wait --for=condition=ready accessgrant/$i-grant -n ${NS_PREFIX}-$i
kubectl get accessgrant $i-grant -n ${NS_PREFIX}-$i -o yaml > temp
URL=`cat temp | yq '.status.url'`
CODE=`cat temp | yq '.status.code'`
CA=`cat temp | yq -r '.status.ca' | awk '{ print "    " $0 }'`
cat >token-$i.yaml <<EOF
apiVersion: skupper.io/v1alpha1
kind: AccessToken
metadata:
  name: $i-token
spec:
  url: ${URL}
  code: ${CODE}
  ca: |
${CA}
EOF

rm temp
echo
done

echo Link dmz-b to dmz-a
kubectl apply -f token-dmz-a.yaml -n ${NS_PREFIX}-dmz-b

ted-ross commented 1 month ago

For your convenience... Here is the cleanup script to un-do the above reproducer:

NS_PREFIX=demo

for i in dmz-a dmz-b;
do
kubectl delete accessgrant $i-grant -n ${NS_PREFIX}-$i
kubectl delete routeraccess $i-peer -n ${NS_PREFIX}-$i
kubectl delete site $i -n ${NS_PREFIX}-$i
rm token-$i.yaml
done

kubectl delete accesstoken dmz-a-token -n ${NS_PREFIX}-dmz-b

ted-ross commented 1 month ago

Possible root cause:

The router has a new behavior post-3.0.0 (I was running the latest in this test). SslProfiles now load the referenced certificate files immediately upon configuration. The old behavior was to load the certificates upon connection-startup for every new connection.

This means that before an sslProfile is created, all of its referenced files must already exist in the file system.

In Skupper, the config-sync module stores the current configuration, including slProfiles, in the router's config-map. When the router starts up, it will read the configuration mounted from that config-map as the initial configuration. The certificate files, however, are not mounted into the router container. They are copied at run-time into a shared file system by the config-sync container.

This means that there is a race condition at pod-startup. If the router reads its initial configuration before config-sync can store the certificate files, the router will shut down due to the incomplete configuration.

skupperproject / skupper

[v2] Multiple crashes of the router pod before network stabilizes #1716