Replace current `testnet-preview` deployment with new k8s deployment - Githubissues

penumbra-zone / penumbra

Penumbra is a fully private proof-of-stake network and decentralized exchange for the Cosmos ecosystem.

https://penumbra.zone

Apache License 2.0

373 stars 293 forks source link

Replace current `testnet-preview` deployment with new k8s deployment #1659

Closed hdevalence closed 1 year ago

hdevalence commented 1 year ago

Is your feature request related to a problem? Please describe.

We should try to move over to the new k8s deployment system built by Strangelove, and start with replacing testnet-preview. The goal of testnet-preview is that it should be an exact preview of what would be deployed if the current state of the main branch were tagged as a release. This ensures that there are no deployment surprises when tagging a release, and allows testing client protocols against the current state of the main branch.

The only difference between testnet-preview. and testnet. should be that when deploying testnet., we pass the --preserve-chain-id parameter to pd testnet generate to avoid randomizing the chain ID (since there should only be one deployment per tag).

Describe the solution you'd like

[x] Update k8s helm chart logic to support side-by-side deploys - https://github.com/penumbra-zone/penumbra/pull/1707
[x] Write new "deploy-preview-new" workflow for testnet-preview on k8s - https://github.com/penumbra-zone/penumbra/pull/1719
[x] Ensure the action is triggered on pushes to main and uses the latest container images (does it need to wait for them to be built?) - https://github.com/penumbra-zone/penumbra/pull/1730
[x] Merge new workflows into main, use GHA interface to trigger them
[x] Compare /status endpoints between e.g. http://testnet-preview.penumbra.zone:26657/status & http://fullnode.testnet-preview.penumbra.zone:26657/status
[ ] Ask for team testing against fullnode.testnet-preview.penumbra.zone
[x] Update DNS for cut-over on "testnet-preview" (can happen whenever, mostly internal use only)
[x] Update DNS for cut-over (will need to be coordinated with next testnet deploy, otherwise we lose state)

hdevalence commented 1 year ago

The current k8s deployment provides TLS access to the Tendermint RPC endpoint (load-balancing over fullnodes). We should provide an additional endpoint that gives TLS access to the pd GRPC endpoint.

We are not in a position to use a TLS endpoint from pcli for kind of boring reasons (we hardcode "http" in a bunch of places, and assume that we have one host for both tendermint + pd with endpoints on different ports), but exposing a TLS pd endpoint is important to do now because we're trying to use grpc-web to access it, and without TLS, this is not really possible because of mixed content rules.

conorsch commented 1 year ago

Took a look at what's required here for cut-over. We currently run two discrete testnets:

testnet.penumbra.zone (built from approximately-weekly tags)
testnet-preview.penumbra.zone (built very frequently from latest HEAD of main branch)

Right now, the k8s deployment logic assumes there's only one testnet, and it destructively resets on updates. That's already a great match for how we manage testnet-preview, but we want to do both on k8s. I'll work on adding a few more knobs to the new deployment logic, so we can set HELM_RELEASE or similar and touch only the proper set of testnet resources during CI runs.

conorsch commented 1 year ago

WIP branch coming together at https://github.com/penumbra-zone/penumbra/tree/1659-testnet-preview-via-k8s. Mostly that diff is adding comments, docs, and some refactoring of the test scripts to make more space for multiple environments. I haven't created a separate cluster, but the Terraform logic is already present to do so. Currently working on:

documenting the fullnode.* A record (DNS records are managed out of band) that relates to NodePort added https://github.com/penumbra-zone/penumbra/pull/1660
debugging unhealthy service backends (this is related to the refactoring work I've done; I broke it); for reference:

kubectl get ingress penumbra-testnet-ingress -o json | jq '.metadata.annotations["ingress.kubernetes.io/backends"]' -r -C
{"k8s-be-30563--0a5eab405c618cec":"HEALTHY","k8s1-0a5eab40-default-penumbra-testnet-26657-ed3f3817":"UNHEALTHY","k8s1-0a5eab40-default-penumbra-testnet-8080-5680cfee":"UNHEALTHY"}

Once those problems are resolved, I'll move on to creating side-by-side environments, and touch up the script as necessary to make sure that subsequent deployments don't clobber unwanted resources.

conorsch commented 1 year ago

Cluster config for "testnet" setup is solid, will PR in some housekeeping changes with more docs, comments, and labels throughout. Encountered a problem when I tried to deploy "testnet-preview":

Error syncing to GCP: error running load balancer syncing routine: loadbalancer 86iwvh2x-default-penumbra-testnet-preview-ingr-tbg5eif4 does not exist: googleapi: Error 403: QUOTA_EXCEEDED - Quota 'IN_USE_ADDRESSES' exceeded. Limit: 8.0 globally.

So it appears we've exhausted our account limit on global reserved IPs. I'll see if we can raise that limit, but more likely we'll need to switch to a lower tier of reserved IP to sidestep that limit.

conorsch commented 1 year ago

Looked into the IP quota issue. There are actually two limits in play: STATIC_ADDRESSES (limit of 8) and IN_USE_ADDRESSES (also limit of 8). We're already careful about reserving only one static address per testnet, so it's not the STATIC_ADDRESSES limit we're hitting. Rather, it's IN_USE_ADDRESSES, since all external IPs in use by Services count toward that total—regardless of whether they're static or not. Requesting a quota bump (which we've had to do similarly for other resource types, such as persistence storage, back in 2022-09) which should unblock. They promise response in <2d, but I expect more like <2h. :crossed_fingers:

For posterity: gcloud compute project-info describe | grep -A1 -B1 ADDR was useful for getting a picture of the limits in play.

conorsch commented 1 year ago

They promise response in <2d, but I expect more like <2h.

OK, it was actually <2m:

gcloud-fast-quota-bump

:upside_down_face:

conorsch commented 1 year ago

Comparing the two testnet deployments for disparities, it looks like node_info.other.tx_index is set to "null" in k8s but "kv" in the testnet config template. Looks like maybe we want to set that to "kv".

conorsch commented 1 year ago

uses the latest container images (does it need to wait for them to be built?)

Not yet implemented on https://github.com/penumbra-zone/penumbra/pull/1719; relevant docs are here https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflow_run

conorsch commented 1 year ago

Currently working on sorting out the implicit workflow dependencies, and making them explicit. For instance, do we want to build container images if the tests fail? We do not! However, that's currently how things work: the container images get built regardless of the state of other workflows.

Similarly, we must strictly order the workflows so that 1) tests pass; then 2) container image is built; then 3) a deploy is made to the relevant environment. GitHub Actions will allow us to chain up to a maximum of three (3) workflows:

You can't use workflow_run to chain together more than three levels of workflows.

Other potential footguns include the need to manually inspect a previous workflow run and inspect whether it failed: by default, a failed dependency workflow will still trigger execution of the dependent workflow, which I still find surprising; additionally, it may not be possible to inspect whether a dependency workflow was triggered due to a tag or a branch change (which is important for us because it's how we gate testnet vs testnet-preview deploys). In the short term, I may opt to copy/paste several workflows and embed them as jobs, to take advantage of more finegrained control of trigger events.

Next testnet is due Monday, 2022-12-12, and I'd very much like to use the new setup. Today, 2022-12-08, I plan to cut over testnet-preview.penumbra.zone as, ahem, a "preview" of what's to come.

conorsch commented 1 year ago

Today, 2022-12-08, I plan to cut over testnet-preview.penumbra.zone

This is done: testnet-preview.penumbra.zone now points to the new k8s deployment. Post-merge it was automatically updated.

terminal output monitoring rollout, for those interested

``` NAME READY STATUS RESTARTS AGE penumbra-testnet-fn-0-fvkfk 3/3 Running 0 25h penumbra-testnet-preview-fn-0-j2hsp 3/3 Terminating 0 24h penumbra-testnet-preview-val-0-7kghg 2/2 Terminating 0 24h penumbra-testnet-preview-val-1-pjll9 2/2 Terminating 0 24h penumbra-testnet-val-0-7fsq2 2/2 Running 0 25h penumbra-testnet-val-1-lkvcj 2/2 Running 0 25h ❯ kubectl get pods NAME READY STATUS RESTARTS AGE penumbra-testnet-fn-0-fvkfk 3/3 Running 0 25h penumbra-testnet-preview-fn-0-q7s8f 0/3 Pending 0 3s penumbra-testnet-preview-val-0-qn5gc 0/2 Pending 0 3s penumbra-testnet-preview-val-1-48mbn 0/2 Pending 0 3s penumbra-testnet-val-0-7fsq2 2/2 Running 0 25h penumbra-testnet-val-1-lkvcj 2/2 Running 0 25h ❯ kubectl get pods NAME READY STATUS RESTARTS AGE penumbra-testnet-fn-0-fvkfk 3/3 Running 0 25h penumbra-testnet-preview-fn-0-q7s8f 3/3 Running 0 11m penumbra-testnet-preview-val-0-qn5gc 2/2 Running 0 11m penumbra-testnet-preview-val-1-48mbn 2/2 Running 0 11m penumbra-testnet-val-0-7fsq2 2/2 Running 0 25h penumbra-testnet-val-1-lkvcj 2/2 Running 0 25h ```

Still more work to do on the workflow dependencies for Monday's deployment; I'll pick that back up tomorrow.

conorsch commented 1 year ago

Calling this done for now. Here's a recent automatic deploy of testnet-preview to the k8s cluster: https://github.com/penumbra-zone/penumbra/actions/runs/3661278131 Come Monday, we'll need to update the A record for testnet.penumbra.zone to point at the relevant IP:

❯ terraform output
testnet_preview_reserved_ip = "34.117.153.161" # already done
testnet_reserved_ip = "34.111.241.130" # this one still needs to be updated

We'll do that as part of the testnet deploy. Already lowered the TTL 30m -> 5m in prep for the cut-over.

conorsch commented 1 year ago

Was not able to use the new cluster setup for testnet 038 today (#1743). In the interest of :shipit:, I fell back to reusing the legacy droplet, and configured pd and tendermint manually based on the 038 code. It was necessary to stand up the services "manually," because the deprecated workflows were removed in #1730; a bit prematurely, in retrospect.

The root cause of the botched cluster deployment lies in my oversight last week of mistakenly deploying the testnet tag to the preview environment (#1744); this was fixed this morning in https://github.com/penumbra-zone/penumbra/commit/17a3267f63e1765d40cd5d3d071292b8bd4f7fbe, but the late discovery of the misconfiguration means we did not have an adequate "preview" environment to observe the most recent cluster config. As such, I suspect we missed identifying some breaking changes recently.

As a result, the current state of our deployments is a bit brittle right now. To wit:

:rotating_light: testnet-preview.penumbra.zone is down; there's currently a CrashLoopBackoff on the relevant pods, due to an unhappy tendermint container
:warning: testnet.penumbra.zone is running on the legacy infra
:warning: the deployment workflows in GitHub Actions are manually disabled, to prevent automatic changes to state (:rainbow: per-PR CI runs remain unaffected)
:warning: galileo is unhappy; this has happened a few times after testnet deploys, but this time i suspect codechanges may be necessary; see https://github.com/penumbra-zone/penumbra/issues/1743#issuecomment-1347327463

Starting tomorrow, I'll focus on unbreaking testnet-preview, since that's our canary in the coal mine. Once preview is happy again, I'll resume deploys of testnet-on-k8s, and provide updates here.

conorsch commented 1 year ago

This is done: testnet-preview is now served via k8s, and has been since 2022-12-12, via 5b42c45d3440b5d56723f392bded8e8cd7fa12cd. I'll open another issue tracking the transition of testnet (cf. testnet-preview) to k8s.