Stabilize RPC rolling update behavior on k8s

sreuland commented 4 months ago

What problem does your feature solve?

RPC usage of Statefulset on k8s may have leave potential for problems during rolling upgrades:

rpc update rollouts on cluster could potentially corrupt the single pvc due to statefulset can only do rolling upgrades which allows for two pods simultaneously to be accessing the pvc during upgrade rollout as k8s spins up pending first before terminating current pod.
rpc downtimes during update rollouts have been consistently observed, it doesn't appear that stateful set rolling upgrade strategy preserves old pods long enough before new pods are ready, i.e. it's cutting over to new pods before they pass readiness. observation notes here

What would you like to see?

The rpc rolling upgrade will not corrupt the underlying pvc and will not exhibit any http service downtime during the rollout.
Evaluate if converting to a Deployment with Recreate update strategy, and deifne a stand-alone PVC which the deployment will reference for volume mount, will fix the http service downtime aspect, that rollout should retain the old pod until new pod passes readinessProbe.

What alternatives are there?

2opremio commented 4 months ago

We should investigate the pvc-mapping implications of though

sreuland commented 4 months ago

We should investigate the pvc-mapping implications of though

yes, referring to PVC lifecycle and how deployment binds to the PVC? the assumption on solution was to define standalone PVC resource using same default storage-class in yml manifest first, and deployment is defined after, then kubectl apply will treat it idempotently, i.e. it will create the PVC the first time only when it detects the PVC by metadata:name does not exist, and any ensuing attempts when it already exists by same name will skip, leaving PVC and data on volume as-is. The deployment would bind to the single PVC by name, and deployment would fail if the PVC has not been created first or if PVC is bound to any other pod at the time.

this task should include time necessary to test applying the new manifest file via kubectl repeatedly in dev namespace first to confirm the outcome as expected.

sreuland commented 4 months ago

have attempted to replicate the scenario where the pvc gets corrupted due to a rolling update being triggered on the rpc statefulset.

not able to replicate corruption on pvc after initiating multiple rolling updates against rpc statefulset, however, can consistently replicate a rpc service downtime window of about 2-3 minutes right after rolling upgrade begins.

disabled flux on statefulset first, then updated the statefulset's container image ref, which triggers k8s to initiate rolling upgrade on the statefulset, I set up a watch on pod and events activity afterwards, this is config'd for pubnet:

$ kubectl get pods -n soroban-rpc-pubnet-dev -o wide --watch
NAME                       READY   STATUS    RESTARTS   AGE   IP              NODE                           NOMINATED NODE   READINESS GATES
soroban-rpc-pubnet-dev-0   1/1     Running   0          60m   192.168.38.83   ip-172-22-20-98.ec2.internal   <none>           <none>
.... statefulset is updated, triggers rolling upgrade
soroban-rpc-pubnet-dev-0   1/1     Terminating   0          61m   192.168.38.83   ip-172-22-20-98.ec2.internal   <none>           <none>
soroban-rpc-pubnet-dev-0   0/1     Terminating   0          61m   192.168.38.83   ip-172-22-20-98.ec2.internal   <none>           <none>
soroban-rpc-pubnet-dev-0   0/1     Terminating   0          61m   192.168.38.83   ip-172-22-20-98.ec2.internal   <none>           <none>
soroban-rpc-pubnet-dev-0   0/1     Terminating   0          61m   192.168.38.83   ip-172-22-20-98.ec2.internal   <none>           <none>
soroban-rpc-pubnet-dev-0   0/1     Terminating   0          61m   192.168.38.83   ip-172-22-20-98.ec2.internal   <none>           <none>
soroban-rpc-pubnet-dev-0   0/1     Terminating   0          61m   192.168.38.83   ip-172-22-20-98.ec2.internal   <none>           <none>
soroban-rpc-pubnet-dev-0   0/1     Pending       0          0s    <none>          <none>                         <none>           <none>
soroban-rpc-pubnet-dev-0   0/1     Pending       0          0s    <none>          ip-172-22-20-98.ec2.internal   <none>           <none>
soroban-rpc-pubnet-dev-0   0/1     ContainerCreating   0          0s    <none>          ip-172-22-20-98.ec2.internal   <none>           <none>
soroban-rpc-pubnet-dev-0   0/1     ContainerCreating   0          2s    <none>          ip-172-22-20-98.ec2.internal   <none>           <none>
soroban-rpc-pubnet-dev-0   0/1     Running             0          6s    192.168.38.98   ip-172-22-20-98.ec2.internal   <none>           <none>
... the above activity prints within a couple seconds, then a window of a couple minutes passes before this next print, signaling the rpc http service is up and readiness probe passed
soroban-rpc-pubnet-dev-0   1/1     Running             0          2m51s   192.168.38.98   ip-172-22-20-98.ec2.internal   <none>           <none>

$ kubectl get events -n soroban-rpc-pubnet-dev --watch
LAST SEEN   TYPE      REASON                   OBJECT                         MESSAGE
0s          Normal    SuccessfulDelete         statefulset/soroban-rpc-pubnet-dev   delete Pod soroban-rpc-pubnet-dev-0 in StatefulSet soroban-rpc-pubnet-dev successful
0s          Normal    SuccessfulCreate         statefulset/soroban-rpc-pubnet-dev   create Pod soroban-rpc-pubnet-dev-0 in StatefulSet soroban-rpc-pubnet-dev successful
0s          Normal    Scheduled                pod/soroban-rpc-pubnet-dev-0         Successfully assigned soroban-rpc-pubnet-dev/soroban-rpc-pubnet-dev-0 to ip-172-22-20-205.ec2.internal
0s          Warning   FailedAttachVolume       pod/soroban-rpc-pubnet-dev-0         Multi-Attach error for volume "pvc-cc6b516c-00e7-45e9-a0de-fb4fbb3d6451" Volume is already exclusively attached to one node and can't be attached to another
0s          Normal    SuccessfulAttachVolume   pod/soroban-rpc-pubnet-dev-0         AttachVolume.Attach succeeded for volume "pvc-cc6b516c-00e7-45e9-a0de-fb4fbb3d6451"
0s          Normal    Pulling                  pod/soroban-rpc-pubnet-dev-0         Pulling image "docker.io/stellar/soroban-rpc:20.3.0"
0s          Normal    Pulled                   pod/soroban-rpc-pubnet-dev-0         Successfully pulled image "docker.io/stellar/soroban-rpc:20.3.0" in 3.21401564s (3.214026454s including waiting)
0s          Normal    Created                  pod/soroban-rpc-pubnet-dev-0         Created container soroban-rpc
0s          Normal    Started                  pod/soroban-rpc-pubnet-dev-0         Started container soroban-rpc
0s          Warning   Unhealthy                pod/soroban-rpc-pubnet-dev-0         Readiness probe failed:   
... readiness probe failures repeat, stops once rpc http service passes readiness probe a couple minutes later

What happens in the window of 2-3 minutes between when the newly spun up statefulset pod first reports readiness of 0/1 and then readiness of 1/1 is rpc http downtime, confirmed it by attempting to access the rpc http service on container and get connection refused, the new pod is running on cluster, and the rpc container is running as a process in pod, verified by exec'ing into the pod/container and seeing the rpc process running, but the rpc service has not started the http service yet, and emits no logs during this initialization time even with LOG_LEVEL=DEBUG, rpc hasn't started stellar-core captive process yet either, as don't see the core shown in the container process list.

This state lasts about 2-3 minutes, during which the statefulset controller is attempting readiness probe checks for getHealth at the pod and they are failing, but the stateful set controller has terminated the older pod earlier as part of rolling update as seen in the captured events, it only has the newer pod which is in this non-ready state. After 2-3 minutes, the stellar-core process appears in container and the first log output from rpc appears as starting Soroban JSON RPC server and the http process appears on port 8000 and responds to getHealth requests.

I think there are two separate/orthogonal fixes:

we continue this ticket, but change the scope to fix the rpc deployment in general terms to avoid downtime period during rollout, evaluate if Deployment with Recreate update strategy retains the old pod until new passes readinessProbe.
add more verbose logging in rpc to emit status in the startup routine prior to the http being started, i.e. add more verbose logging of stuff being done before starting Soroban JSON RPC server is logged.

cc: @2opremio

sreuland commented 4 months ago

I ran some additional tests to verify 2-3 minute lag behavior as was seen here on rpc, the behavior only occurs in k8s pod/container as part of update rollout. I was able to verify same rpc version running in non-k8s(quickstart) and the rpc process starts the http and metrics listeners, ingestion(captive core), and logging, immediately after rpc process is started on the o/s.

sreuland commented 3 months ago

we continue this ticket, but change the scope to fix the rpc deployment in general terms to avoid downtime period during rollout, evaluate if Deployment with Recreate update strategy retains the old pod until new passes readinessProbe.

identified the silent startup issue as unrelated to kubernetes aspects, was due to rpc internal initialization - https://github.com/stellar/soroban-rpc/issues/128

add more verbose logging in rpc to emit status in the startup routine prior to the http being started, i.e. add more verbose logging of stuff being done before starting Soroban JSON RPC server is logged.

done on https://github.com/stellar/soroban-rpc/pull/127

sreuland commented 3 months ago

after identification of https://github.com/stellar/soroban-rpc/issues/128 and https://github.com/stellar/soroban-rpc/pull/127, the remaining scope on this issue is now obsolete, further discussion with team has narrowed rpc rolling update downtime to be addressed on - https://github.com/stellar/helm-charts/issues/82.

stellar / helm-charts