sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.27k forks source link

deploy-sourcegraph: restricted integration test fails with Kubernetes 1.16+ #14728

Closed bobheadxi closed 3 years ago

bobheadxi commented 3 years ago

https://github.com/sourcegraph/deploy-sourcegraph/pull/1067 builds (e.g. https://buildkite.com/sourcegraph/deploy-sourcegraph/builds/4557) and other deploy-sourcegraph builds were failing for a variety of reasons:

ERROR: (gcloud.container.clusters.delete) Some requests did not succeed:
 - args: ["Operation [<Operation\n clusterConditions: [<StatusCondition\n message: 'Failed to delete cluster'>]\n detail: 'Failed to delete cluster'\n endTime: '2020-10-08T16:03:27.65647411Z'\n name: 'operation-1602172835577-6227ef6b'\n nodepoolConditions: []\n operationType: OperationTypeValueValuesEnum(DELETE_CLUSTER, 2)\n selfLink: 'https://container.googleapis.com/v1/projects/232313352803/zones/us-central1-a/operations/operation-1602172835577-6227ef6b'\n startTime: '2020-10-08T16:00:35.577431094Z'\n status: StatusValueValuesEnum(DONE, 3)\n statusMessage: 'Failed to delete cluster'\n targetLink: 'https://container.googleapis.com/v1/projects/232313352803/zones/us-central1-a/clusters/ds-test-restricted-202943b6'\n zone: 'us-central1-a'>] finished with error: Failed to delete cluster"]
   exit_code: 1

Then, a few days later:

+ kubectl create role -n ns-sourcegraph nonroot:unprivileged --verb=use --resource=podsecuritypolicy --resource-name=nonroot-policy
error: can not perform 'use' on 'podsecuritypolicies' in group 'policy'

Since the latter error (which occurs early in the test) replaced the former, I didn't investigate further into the first error, but the second error was traced down to a version change in the test (due to versions not being pinned) to Kubernetes 1.16+, which I was unable to resolve, possibly related to:

A workaround to unblock the 3.21 release was introduced in https://github.com/sourcegraph/deploy-sourcegraph/pull/1068 by simply pinning the kubernetes version to the version used in the last passing build, 1.15.12-gke.20

This is probably not a long-term solution since we will likely need to upgrade eventually, so follow-up items include:

Stretch goals might include running the restricted test across multiple versions the same way the "fresh" test does via Pulumi as well

bobheadxi commented 3 years ago

@uwedeportivo assigning you for now since you might have some context given it seems you wrote the handbook page on it (https://github.com/sourcegraph/about/pull/534), but let me know if that's not the case!

daxmc99 commented 3 years ago

Fixed here https://github.com/sourcegraph/deploy-sourcegraph/commit/11f118af7708d342f73825ae7cf8dfa005e95614#diff-15e0ca283574f31c001f74562a52a0108305e3df7e86c9181a21b8401b6a9273R53-R54

But we should evaluate a CI solution that doesn't require spinning up a GKE cluster to test. Our CI is a bit brittle on deploy-sourcegraph right now.