Open ted-ross opened 1 month ago
For your convenience... Here is the cleanup script to un-do the above reproducer:
NS_PREFIX=demo
for i in dmz-a dmz-b;
do
kubectl delete accessgrant $i-grant -n ${NS_PREFIX}-$i
kubectl delete routeraccess $i-peer -n ${NS_PREFIX}-$i
kubectl delete site $i -n ${NS_PREFIX}-$i
rm token-$i.yaml
done
kubectl delete accesstoken dmz-a-token -n ${NS_PREFIX}-dmz-b
Possible root cause:
The router has a new behavior post-3.0.0 (I was running the latest in this test). SslProfiles now load the referenced certificate files immediately upon configuration. The old behavior was to load the certificates upon connection-startup for every new connection.
This means that before an sslProfile is created, all of its referenced files must already exist in the file system.
In Skupper, the config-sync module stores the current configuration, including slProfiles, in the router's config-map. When the router starts up, it will read the configuration mounted from that config-map as the initial configuration. The certificate files, however, are not mounted into the router container. They are copied at run-time into a shared file system by the config-sync container.
This means that there is a race condition at pod-startup. If the router reads its initial configuration before config-sync can store the certificate files, the router will shut down due to the incomplete configuration.
Describe the bug Using a scripted two-site demonstration setup (script included below), the sites initialize but the router pods go into crash-loop-backoff and restart twice before the network finally stablizes.
How To Reproduce On a cluster, create namespaces demo-dmz-a and demo-dmz-b and run the provided setup script. Watch the pods and sites to observe crashes prior to network stabilization.
Expected behavior I expect the network to stabilize in an orderly fashion without seeing crash indications.
Environment details
Additional context The error seen in the router log prior to crashing:
The script used to reproduce: