stellar / helm-charts

Helm charts for deploying SDF maintained software
3 stars 12 forks source link

enable zero-downtime deployments for RPC #82

Closed mollykarcher closed 3 months ago

mollykarcher commented 3 months ago

What problem does your feature solve?

In it's current form, RPC takes ~30 minutes to deploy new versions to pubnet (thread1, thread2) due to iops limits when initializing it's in-memory data storage from disk.

What would you like to see?

A new RPC version rolls out, and there's no disruption in service. There is also no loss of historical transaction/events history upon rollout (that is, the db/history does not reset to nothing).

What alternatives are there?

sreuland commented 3 months ago

I think both options converge to option#2 as a blue/green, with two deployments on cluster one for each color, as it is not possible to update a single replica(pod) within one deployment that has replicas set to more than 1, i.e. all replicas(pods) will inherit the config set on the deployment(defined as pod spec in the deployment spec), this is maintained by the Deployment controller which runs on cluster and constantly monitors deployment pod states to make sure they equal the deployment spec and match deployment replicas count.

sreuland commented 3 months ago

we discussed this more in platform team meeting, and thanks @mollykarcher for wrangling ideas further on chat, your summarized option 'magic bullet' approach of using existing StatefuleSet with replicas=2 sounds like viable option to achieve zero down time during upgrades. this provisions for one ordinal pod to always be healthy during upgrade and routable(included as Endpoint) on the k8s Service associated to the StatefuleSet.

Untitled-2023-02-16-1504

So, we should test replicas=2 out in dev to determine if we can land on that to resolve here. One potential caveat from having this horizontally scaled model when both replicas are healthy and routed to service, there may be potential for each instance to be slightly off on their ingested ledger/network states, potentially reporting different responses for same url requests at about the same time. We'd have to see how this looks at run time to see if significant.

Another interesting option if we want to explore a blue/green or canary approach further is with statefulset rollingupdate partitioning which seems to provide a basis for either of those.

mollykarcher commented 3 months ago

...there may be potential for each instance to be slightly off on their ingested ledger/network states, potentially reporting different responses for same url requests at about the same time

I agree that this possibility exists, but let's not over-optimize before we know we have a problem. For now, we might want to just monitor and/or alert on any persistent diff in the LCL between the two instances. Could give us a sense of how likely this issue is.

We could also probably delay/lessen the effects of this simply by enabling sticky sessions/session affinity on the rpc ingress.

sreuland commented 3 months ago

results with replicas=2 on dev:

sreuland commented 3 months ago

two thirds of this effort are complete: the k8s resource changes are done on dev cluster here: https://github.com/stellar/kube/pull/2098

the helm-chart update to include the changes: https://github.com/stellar/helm-charts/pull/84

last step will be to merge same change to dev cluster when soroban rpc 21.0.0 is GA.