Closed hdevalence closed 1 year ago
The current k8s deployment provides TLS access to the Tendermint RPC endpoint (load-balancing over fullnodes). We should provide an additional endpoint that gives TLS access to the pd
GRPC endpoint.
We are not in a position to use a TLS endpoint from pcli
for kind of boring reasons (we hardcode "http" in a bunch of places, and assume that we have one host for both tendermint
+ pd
with endpoints on different ports), but exposing a TLS pd
endpoint is important to do now because we're trying to use grpc-web
to access it, and without TLS, this is not really possible because of mixed content rules.
Took a look at what's required here for cut-over. We currently run two discrete testnets:
Right now, the k8s deployment logic assumes there's only one testnet, and it destructively resets on updates. That's already a great match for how we manage testnet-preview, but we want to do both on k8s. I'll work on adding a few more knobs to the new deployment logic, so we can set HELM_RELEASE or similar and touch only the proper set of testnet resources during CI runs.
WIP branch coming together at https://github.com/penumbra-zone/penumbra/tree/1659-testnet-preview-via-k8s. Mostly that diff is adding comments, docs, and some refactoring of the test scripts to make more space for multiple environments. I haven't created a separate cluster, but the Terraform logic is already present to do so. Currently working on:
fullnode.*
A record (DNS records are managed out of band) that relates to NodePort added https://github.com/penumbra-zone/penumbra/pull/1660kubectl get ingress penumbra-testnet-ingress -o json | jq '.metadata.annotations["ingress.kubernetes.io/backends"]' -r -C
{"k8s-be-30563--0a5eab405c618cec":"HEALTHY","k8s1-0a5eab40-default-penumbra-testnet-26657-ed3f3817":"UNHEALTHY","k8s1-0a5eab40-default-penumbra-testnet-8080-5680cfee":"UNHEALTHY"}
Once those problems are resolved, I'll move on to creating side-by-side environments, and touch up the script as necessary to make sure that subsequent deployments don't clobber unwanted resources.
Cluster config for "testnet" setup is solid, will PR in some housekeeping changes with more docs, comments, and labels throughout. Encountered a problem when I tried to deploy "testnet-preview":
Error syncing to GCP: error running load balancer syncing routine: loadbalancer 86iwvh2x-default-penumbra-testnet-preview-ingr-tbg5eif4 does not exist: googleapi: Error 403: QUOTA_EXCEEDED - Quota 'IN_USE_ADDRESSES' exceeded. Limit: 8.0 globally.
So it appears we've exhausted our account limit on global reserved IPs. I'll see if we can raise that limit, but more likely we'll need to switch to a lower tier of reserved IP to sidestep that limit.
Looked into the IP quota issue. There are actually two limits in play: STATIC_ADDRESSES
(limit of 8) and IN_USE_ADDRESSES
(also limit of 8). We're already careful about reserving only one static address per testnet, so it's not the STATIC_ADDRESSES limit we're hitting. Rather, it's IN_USE_ADDRESSES, since all external IPs in use by Services count toward that total—regardless of whether they're static or not. Requesting a quota bump (which we've had to do similarly for other resource types, such as persistence storage, back in 2022-09) which should unblock. They promise response in <2d, but I expect more like <2h. :crossed_fingers:
For posterity: gcloud compute project-info describe | grep -A1 -B1 ADDR
was useful for getting a picture of the limits in play.
They promise response in <2d, but I expect more like <2h.
OK, it was actually <2m:
:upside_down_face:
Comparing the two testnet deployments for disparities, it looks like node_info.other.tx_index
is set to "null"
in k8s but "kv"
in the testnet config template. Looks like maybe we want to set that to "kv"
.
uses the latest container images (does it need to wait for them to be built?)
Not yet implemented on https://github.com/penumbra-zone/penumbra/pull/1719; relevant docs are here https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflow_run
Currently working on sorting out the implicit workflow dependencies, and making them explicit. For instance, do we want to build container images if the tests fail? We do not! However, that's currently how things work: the container images get built regardless of the state of other workflows.
Similarly, we must strictly order the workflows so that 1) tests pass; then 2) container image is built; then 3) a deploy is made to the relevant environment. GitHub Actions will allow us to chain up to a maximum of three (3) workflows:
You can't use
workflow_run
to chain together more than three levels of workflows.
Other potential footguns include the need to manually inspect a previous workflow run and inspect whether it failed: by default, a failed dependency workflow will still trigger execution of the dependent workflow, which I still find surprising; additionally, it may not be possible to inspect whether a dependency workflow was triggered due to a tag or a branch change (which is important for us because it's how we gate testnet vs testnet-preview deploys). In the short term, I may opt to copy/paste several workflows and embed them as jobs, to take advantage of more finegrained control of trigger events.
Next testnet is due Monday, 2022-12-12, and I'd very much like to use the new setup. Today, 2022-12-08, I plan to cut over testnet-preview.penumbra.zone
as, ahem, a "preview" of what's to come.
Today, 2022-12-08, I plan to cut over testnet-preview.penumbra.zone
This is done: testnet-preview.penumbra.zone now points to the new k8s deployment. Post-merge it was automatically updated.
Still more work to do on the workflow dependencies for Monday's deployment; I'll pick that back up tomorrow.
Calling this done for now. Here's a recent automatic deploy of testnet-preview to the k8s cluster: https://github.com/penumbra-zone/penumbra/actions/runs/3661278131 Come Monday, we'll need to update the A record for testnet.penumbra.zone
to point at the relevant IP:
❯ terraform output
testnet_preview_reserved_ip = "34.117.153.161" # already done
testnet_reserved_ip = "34.111.241.130" # this one still needs to be updated
We'll do that as part of the testnet deploy. Already lowered the TTL 30m -> 5m in prep for the cut-over.
Was not able to use the new cluster setup for testnet 038 today (#1743). In the interest of :shipit:, I fell back to reusing the legacy droplet, and configured pd
and tendermint
manually based on the 038 code. It was necessary to stand up the services "manually," because the deprecated workflows were removed in #1730; a bit prematurely, in retrospect.
The root cause of the botched cluster deployment lies in my oversight last week of mistakenly deploying the testnet tag to the preview environment (#1744); this was fixed this morning in https://github.com/penumbra-zone/penumbra/commit/17a3267f63e1765d40cd5d3d071292b8bd4f7fbe, but the late discovery of the misconfiguration means we did not have an adequate "preview" environment to observe the most recent cluster config. As such, I suspect we missed identifying some breaking changes recently.
As a result, the current state of our deployments is a bit brittle right now. To wit:
Starting tomorrow, I'll focus on unbreaking testnet-preview, since that's our canary in the coal mine. Once preview is happy again, I'll resume deploys of testnet-on-k8s, and provide updates here.
This is done: testnet-preview is now served via k8s, and has been since 2022-12-12, via 5b42c45d3440b5d56723f392bded8e8cd7fa12cd. I'll open another issue tracking the transition of testnet
(cf. testnet-preview
) to k8s.
Is your feature request related to a problem? Please describe.
We should try to move over to the new k8s deployment system built by Strangelove, and start with replacing
testnet-preview
. The goal oftestnet-preview
is that it should be an exact preview of what would be deployed if the current state of themain
branch were tagged as a release. This ensures that there are no deployment surprises when tagging a release, and allows testing client protocols against the current state of themain
branch.The only difference between
testnet-preview.
andtestnet.
should be that when deployingtestnet.
, we pass the--preserve-chain-id
parameter topd testnet generate
to avoid randomizing the chain ID (since there should only be one deployment per tag).Describe the solution you'd like
main
and uses the latest container images (does it need to wait for them to be built?) - https://github.com/penumbra-zone/penumbra/pull/1730/status
endpoints between e.g. http://testnet-preview.penumbra.zone:26657/status & http://fullnode.testnet-preview.penumbra.zone:26657/status