Add external-dns package addon e2e test to CI

mcwumbly commented 3 years ago

Feature Request

I expect the tests added in #753 to be executed regularly against latest code so we are sure that the behavior they specify still works (namely that the external dns package works), and that the test framework itself works.

We should take into consideration the contributor experience as well and make it easier if possible - not harder - to run these tests locally. By running it in CI contributors will also have executable documentation to reference when we want to run them ourselves. But there is a risk we introduce additional overhead for doing so if it is automated in a way that isn't easy for a human to execute when desired. Something to keep in mind...

Describe alternatives you've considered

Additional context

While reviewing #753, I was confronted with the fact that the tests are not part of CI. This was apparent because:

I had to run them myself to validate that these tests work
They didn't just work. Some scripts documented as dependencies had already moved, and the new scripts don't currently work.

This likely depends on https://github.com/vmware-tanzu/tanzu-framework/issues/206 and #1109 and https://github.com/vmware-tanzu/tce/issues/1324 being resolved first.

jpmcb commented 3 years ago

cc @vmware-tanzu/tce-releng

mcwumbly commented 3 years ago

We tried something that looks like it might work: https://github.com/tylerschultz/tce/tree/e2e-external-dns-ci-gh-action

We also pushed that directly to this repo for testing the action here: https://github.com/vmware-tanzu/tce/tree/dave/delete-me

But it's currently hanging when creating the standalone cluster: https://github.com/vmware-tanzu/tce/runs/3255323470?check_suite_focus=true

cluster state is unchanged 55
cluster control plane is still being initialized, retrying

We have seen that succeed locally in the past though, so perhaps there is something to sort out with getting it to run on the GitHub actions runner. (We also see similar, though not exactly the same, failures here: https://github.com/vmware-tanzu/tce/actions/workflows/e2e-tce-docker-standalone-cluster.yaml)

cc @tylerschultz

mcwumbly commented 3 years ago

Today, we spent our time trying to find a faster loop for reproducing the failure in CI:

Our branch has been updated: https://github.com/vmware-tanzu/tce/tree/dave/delete-me

It has the following changes (via shameless hacking):

only build for linux (skip darwin and windows)
only build the admin, core, management, and standalone plugins
kill cluster creation after 5 minutes and capture diagnostic output

With those changes time to build goes from ~35 min to ~5 min, at which point a standalone cluster creation is attempted (and we expect to see it hang).

Here's the latest run: https://github.com/vmware-tanzu/tce/runs/3267160114?check_suite_focus=true

capd logs show that the kubeadm init command fails on the cluster being created... not sure why though:

...
Waiting for the kubelet to boot up the control plane as static Pods from directory \"/etc/kubernetes/manifests\". This can take up to 4m0s\n[kubelet-check] Initial timeout of 40s passed.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get \"http://localhost:10248/healthz\": dial tcp [::1]:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get \"http://localhost:10248/healthz\": dial tcp [::1]:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get \"http://localhost:10248/healthz\": dial tcp [::1]:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get \"http://localhost:10248/healthz\": dial tcp [::1]:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get \"http://localhost:10248/healthz\": dial tcp [::1]:10248: connect: connection refused.\n\n\tUnfortunately, an error has occurred:\n\t\ttimed out waiting for the condition\n\n\tThis error is likely caused by:\n\t\t- The kubelet is not running\n\t\t- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)\n\n\tIf you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:\n\t\t- 'systemctl status kubelet'\n\t\t- 'journalctl -xeu kubelet'\n\n\tAdditionally, a control plane component may have crashed or exited when started by the container runtime.\n\tTo troubleshoot, list all containers using your preferred container runtimes CLI.\n\n\tHere is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:\n\t\t- 'crictl --runtime-endpoint /var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'\n\t\tOnce you have found the failing container, you can inspect its logs with:\n\t\t- 'crictl --runtime-endpoint /var/run/containerd/containerd.sock logs CONTAINERID'\n\n"

(visit that link above to the workflow run for more complete logs)

cc @karuppiah7890 as this may be helpful for comparing notes and debugging the other e2e tests in github actions as well...

karuppiah7890 commented 3 years ago

Thanks a lot @mcwumbly !! I usually end up not building multiple times and instead use stable release. Today I decided that to be able to use the latest release, I'll just build once and host it in my fork repo and use https://github.com/gruntwork-io/fetch/ to fetch it and then install it using install.sh . The artifact download did end up taking sometime every time because it was around 165MB

I also created a debug script some weeks ago, it's here - https://github.com/karuppiah7890/tce/blob/e2e-docker-ga-trial/test/docker/debug-tce-install.sh

karuppiah7890 commented 3 years ago

This is very very cool @mcwumbly ! 😁

karuppiah7890 commented 3 years ago

With the latest Tanzu CLI that is being used in TCE, I'm not able to just install Tanzu CLI + TCE standalone cluster plugin. I just tried and it didn't work out. I'm gonna dig into this tomorrow or later I guess, I noticed this a couple of times now as part of TCE installation issues and avoided it to get around to other issues of cluster creation

could not write file: open /home/runner/.local/share/tanzu-cli/tanzu-plugin-pinniped-auth: not a directory

Used this release - https://github.com/karuppiah7890/tce/releases/tag/v0.7.0-rc.1-karuppiah , https://github.com/karuppiah7890/tce/releases/download/v0.7.0-rc.1-karuppiah/tce-linux-amd64-v0.7.0-dev.2-karuppiah.tar.gz tar ball

I think the error comes up regardless of what Tanzu command is executed. I was trying create and delete commands that are in standalone-cluster plugin, and then the tanzu plugin repo add command

mcwumbly commented 3 years ago

@karuppiah7890 - have you ever been able to get a standalone cluster to successfully create in a GitHub actions runner? Is that something that used to work and may have broken at some point? Or is it something we have yet to get working?

karuppiah7890 commented 3 years ago

It has worked before, but I think it was a long time ago. Given it's an E2E test and there are multiple components involved, providers, and the k8s cluster itself with lots of components, many things have gone wrong in the course of time

karuppiah7890 commented 3 years ago

Here's a running green pipeline in my fork - https://github.com/karuppiah7890/tce/runs/3044224661?check_suite_focus=true . There are few greens in TCE repo too

karuppiah7890 commented 3 years ago

But I must admit that the number of reds is too high compared to the green. Also, we run AWS E2E tests for every commit now, and it's all red all over the place 🙈 I'm wondering if it can be a nightly till all issues / errors are resolved and it's stable enough

karuppiah7890 commented 3 years ago

Actually, the very few greens that were present in the actions list are gone now because of a change in workflow config yaml file name, so as of now you cannot find a single green pipeline in the main TCE repo. GitHub Actions uses the workflow config yaml file name for the URL and uses that to track the pipelines :/ Like this https://github.com/vmware-tanzu/tce/actions/workflows/e2e-all-tests.yaml?query=is%3Asuccess

mcwumbly commented 3 years ago

Now fetching the kubelet logs on the control plane node, I see it crashing repeatedly with this error:

225 09T15:04:07.7581488Z Aug 09 15:03:23 my-cluster-control-plane-bl75w kubelet[682]: E0809 15:03:23.572437 682 node_container_manager_linux.go:57] "Failed to create cgroup" err="Unit kubepods.slice already exists." cgroupName=[kubepods] 226 09T15:04:07.7585006Z Aug 09 15:03:23 my-cluster-control-plane-bl75w kubelet[682]: E0809 15:03:23.572572 682 kubelet.go:1384] "Failed to start ContainerManager" err="Unit kubepods.slice already exists." 227 09T15:04:07.7587293Z Aug 09 15:03:23 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE 228 09T15:04:07.7589085Z Aug 09 15:03:23 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Failed with result 'exit-code'. 229 09T15:04:07.7590831Z Aug 09 15:03:24 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 13.

longer snippet:

2021-08-09T15:04:07.7579296Z Aug 09 15:03:23 my-cluster-control-plane-bl75w kubelet[682]: I0809 15:03:23.566943     682 policy_none.go:44] "None policy: Start"
2021-08-09T15:04:07.7581488Z Aug 09 15:03:23 my-cluster-control-plane-bl75w kubelet[682]: E0809 15:03:23.572437     682 node_container_manager_linux.go:57] "Failed to create cgroup" err="Unit kubepods.slice already exists." cgroupName=[kubepods]
2021-08-09T15:04:07.7585006Z Aug 09 15:03:23 my-cluster-control-plane-bl75w kubelet[682]: E0809 15:03:23.572572     682 kubelet.go:1384] "Failed to start ContainerManager" err="Unit kubepods.slice already exists."
2021-08-09T15:04:07.7587293Z Aug 09 15:03:23 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
2021-08-09T15:04:07.7589085Z Aug 09 15:03:23 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Failed with result 'exit-code'.
2021-08-09T15:04:07.7590831Z Aug 09 15:03:24 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 13.
2021-08-09T15:04:07.7592505Z Aug 09 15:03:24 my-cluster-control-plane-bl75w systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
2021-08-09T15:04:07.7594074Z Aug 09 15:03:24 my-cluster-control-plane-bl75w systemd[1]: Started kubelet: The Kubernetes Node Agent.
2021-08-09T15:04:07.7596753Z Aug 09 15:03:24 my-cluster-control-plane-bl75w kubelet[731]: Flag --eviction-hard has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2021-08-09T15:04:07.7600625Z Aug 09 15:03:24 my-cluster-control-plane-bl75w kubelet[731]: Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2021-08-09T15:04:07.7604106Z Aug 09 15:03:24 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:24.980591     731 server.go:197] "Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead"
2021-08-09T15:04:07.7607277Z Aug 09 15:03:24 my-cluster-control-plane-bl75w kubelet[731]: Flag --eviction-hard has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2021-08-09T15:04:07.7610817Z Aug 09 15:03:24 my-cluster-control-plane-bl75w kubelet[731]: Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2021-08-09T15:04:07.7613907Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.008518     731 server.go:440] "Kubelet version" kubeletVersion="v1.21.2+vmware.1-360497810732255795"
2021-08-09T15:04:07.7615857Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.008900     731 server.go:851] "Client rotation is on, will bootstrap in background"
2021-08-09T15:04:07.7617960Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.011510     731 certificate_store.go:130] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".
2021-08-09T15:04:07.7620522Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.013206     731 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
2021-08-09T15:04:07.7623773Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.168871     731 server.go:660] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
2021-08-09T15:04:07.7626014Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169137     731 container_manager_linux.go:278] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
2021-08-09T15:04:07.7634212Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169206     731 container_manager_linux.go:283] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:remote CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:systemd KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
2021-08-09T15:04:07.7643805Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169240     731 topology_manager.go:120] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
2021-08-09T15:04:07.7646156Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169255     731 container_manager_linux.go:314] "Initializing Topology Manager" policy="none" scope="container"
2021-08-09T15:04:07.7648309Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169263     731 container_manager_linux.go:319] "Creating device plugin manager" devicePluginEnabled=true
2021-08-09T15:04:07.7651101Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169348     731 util_unix.go:103] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="/var/run/containerd/containerd.sock" fullURLFormat="unix:///var/run/containerd/containerd.sock"
2021-08-09T15:04:07.7653459Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169420     731 remote_runtime.go:62] parsed scheme: ""
2021-08-09T15:04:07.7655244Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169428     731 remote_runtime.go:62] scheme "" not registered, fallback to default scheme
2021-08-09T15:04:07.7657446Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169458     731 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
2021-08-09T15:04:07.7659553Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169467     731 clientconn.go:948] ClientConn switching balancer to "pick_first"
2021-08-09T15:04:07.7662388Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169501     731 util_unix.go:103] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="/var/run/containerd/containerd.sock" fullURLFormat="unix:///var/run/containerd/containerd.sock"
2021-08-09T15:04:07.7665256Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169513     731 remote_image.go:50] parsed scheme: ""
2021-08-09T15:04:07.7667159Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169519     731 remote_image.go:50] scheme "" not registered, fallback to default scheme
2021-08-09T15:04:07.7669350Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169532     731 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
2021-08-09T15:04:07.7671467Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169540     731 clientconn.go:948] ClientConn switching balancer to "pick_first"
2021-08-09T15:04:07.7674064Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169601     731 kubelet.go:404] "Attempting to sync node with API server"
2021-08-09T15:04:07.7676057Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169623     731 kubelet.go:272] "Adding static pod path" path="/etc/kubernetes/manifests"
2021-08-09T15:04:07.7678239Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169644     731 kubelet.go:283] "Adding apiserver pod source"
2021-08-09T15:04:07.7680680Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169656     731 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
2021-08-09T15:04:07.7683340Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169811     731 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
2021-08-09T15:04:07.7686902Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.184351     731 kuberuntime_manager.go:222] "Container runtime initialized" containerRuntime="containerd" version="v1.3.3-14-g449e9269" apiVersion="v1alpha2"
2021-08-09T15:04:07.7689891Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.374273     731 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
2021-08-09T15:04:07.7692285Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]:         For verbose messaging see aws.Config.CredentialsChainVerboseErrors
2021-08-09T15:04:07.7694261Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.375137     731 server.go:1190] "Started kubelet"
2021-08-09T15:04:07.7697627Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.376990     731 cri_stats_provider.go:369] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
2021-08-09T15:04:07.7700792Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.377110     731 kubelet.go:1306] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
2021-08-09T15:04:07.7713002Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.377612     731 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"my-cluster-control-plane-bl75w.1699ab970480c17a", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"my-cluster-control-plane-bl75w", UID:"my-cluster-control-plane-bl75w", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"my-cluster-control-plane-bl75w"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc03c6ecf565b9f7a, ext:587487459, loc:(*time.Location)(0x74bc600)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc03c6ecf565b9f7a, ext:587487459, loc:(*time.Location)(0x74bc600)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://172.18.0.3:6443/api/v1/namespaces/default/events": EOF'(may retry after sleeping)
2021-08-09T15:04:07.7722145Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.388215     731 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
2021-08-09T15:04:07.7723967Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.388441     731 server.go:149] "Starting to listen" address="0.0.0.0" port=10250
2021-08-09T15:04:07.7725680Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.395151     731 server.go:405] "Adding debug handlers to kubelet server"
2021-08-09T15:04:07.7727431Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.401007     731 volume_manager.go:271] "Starting Kubelet Volume Manager"
2021-08-09T15:04:07.7729490Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.403479     731 desired_state_of_world_populator.go:141] "Desired state populator starts to run"
2021-08-09T15:04:07.7732934Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.408557     731 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
2021-08-09T15:04:07.7735425Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.408900     731 client.go:86] parsed scheme: "unix"
2021-08-09T15:04:07.7738776Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.408984     731 client.go:86] scheme "unix" not registered, fallback to default scheme
2021-08-09T15:04:07.7742112Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.409214     731 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
2021-08-09T15:04:07.7744370Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.409302     731 clientconn.go:948] ClientConn switching balancer to "pick_first"
2021-08-09T15:04:07.7749230Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.471386     731 manager.go:1123] Failed to create existing container: /actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7756690Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.473914     731 manager.go:1123] Failed to create existing container: /actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7761462Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.477843     731 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv4
2021-08-09T15:04:07.7763627Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.504720     731 kubelet.go:2291] "Error getting node" err="node \"my-cluster-control-plane-bl75w\" not found"
2021-08-09T15:04:07.7765766Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.586326     731 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv6
2021-08-09T15:04:07.7767676Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.586572     731 status_manager.go:157] "Starting to sync pod status with apiserver"
2021-08-09T15:04:07.7769424Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.586798     731 kubelet.go:1846] "Starting kubelet main sync loop"
2021-08-09T15:04:07.7771623Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.586930     731 kubelet.go:1870] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
2021-08-09T15:04:07.7774436Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.600832     731 kubelet_node_status.go:71] "Attempting to register node" node="my-cluster-control-plane-bl75w"
2021-08-09T15:04:07.7777125Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.601753     731 kubelet_node_status.go:93] "Unable to register node with API server" err="Post \"https://172.18.0.3:6443/api/v1/nodes\": EOF" node="my-cluster-control-plane-bl75w"
2021-08-09T15:04:07.7779932Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.606489     731 kubelet.go:2291] "Error getting node" err="node \"my-cluster-control-plane-bl75w\" not found"
2021-08-09T15:04:07.7785380Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.659313     731 manager.go:1123] Failed to create existing container: /actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7793848Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.660726     731 manager.go:1123] Failed to create existing container: /actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7801707Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: W0809 15:03:25.682751     731 manager.go:1176] Failed to process watch event {EventType:0 Name:/actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13 WatchSource:0}: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7810388Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.687644     731 kubelet.go:1870] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
2021-08-09T15:04:07.7814972Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: W0809 15:03:25.688437     731 manager.go:1176] Failed to process watch event {EventType:0 Name:/actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13 WatchSource:0}: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7819157Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.707457     731 kubelet.go:2291] "Error getting node" err="node \"my-cluster-control-plane-bl75w\" not found"
2021-08-09T15:04:07.7821325Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724037     731 cpu_manager.go:199] "Starting CPU manager" policy="none"
2021-08-09T15:04:07.7823153Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724065     731 cpu_manager.go:200] "Reconciling" reconcilePeriod="10s"
2021-08-09T15:04:07.7824925Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724100     731 state_mem.go:36] "Initialized new in-memory state store"
2021-08-09T15:04:07.7826629Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724328     731 state_mem.go:88] "Updated default CPUSet" cpuSet=""
2021-08-09T15:04:07.7828391Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724365     731 state_mem.go:96] "Updated CPUSet assignments" assignments=map[]
2021-08-09T15:04:07.7830298Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724373     731 policy_none.go:44] "None policy: Start"
2021-08-09T15:04:07.7832320Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.729766     731 node_container_manager_linux.go:57] "Failed to create cgroup" err="Unit kubepods.slice already exists." cgroupName=[kubepods]
2021-08-09T15:04:07.7835385Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.729792     731 kubelet.go:1384] "Failed to start ContainerManager" err="Unit kubepods.slice already exists."
2021-08-09T15:04:07.7837440Z Aug 09 15:03:25 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
2021-08-09T15:04:07.7839170Z Aug 09 15:03:25 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Failed with result 'exit-code'.
2021-08-09T15:04:07.7840904Z Aug 09 15:03:26 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 14.
...
2021-08-09T15:04:08.0128001Z Aug 09 15:03:42 my-cluster-control-plane-bl75w kubelet[1286]: E0809 15:03:42.072174    1286 node_container_manager_linux.go:57] "Failed to create cgroup" err="Unit kubepods.slice already exists." cgroupName=[kubepods]
2021-08-09T15:04:08.0130264Z Aug 09 15:03:42 my-cluster-control-plane-bl75w kubelet[1286]: E0809 15:03:42.072201    1286 kubelet.go:1384] "Failed to start ContainerManager" err="Unit kubepods.slice already exists."
2021-08-09T15:04:08.0132662Z Aug 09 15:03:42 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
2021-08-09T15:04:08.0134404Z Aug 09 15:03:42 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Failed with result 'exit-code'.
2021-08-09T15:04:08.0136336Z Aug 09 15:03:43 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 23.

karuppiah7890 commented 3 years ago

Same, I noticed it here too after using the latest main and your debug code - https://github.com/karuppiah7890/tce/runs/3291132133?check_suite_focus=true

karuppiah7890 commented 3 years ago

I noticed that there's already a discussion going on about fixing this in our internal slack, as it seems like a K8s level thing, runc thing

https://github.com/kubernetes/kubernetes/issues/102676

https://github.com/opencontainers/runc/issues/2996

https://github.com/opencontainers/runc/pull/2997

mcwumbly commented 3 years ago

clicked "subscribe" on the k/k issue.

until that lands and we're able to consume it, is there a way to override the kubernetes release we're using and set it to 1.20.x to work around this?

i see KUBERNETES_RELEASE and KUBERNETES_VERSION params in the ~/.config/tanzu/tkg/providers/config_default.yaml file in the "you shouldn't be touching this" section, but it's not clear to me whether one of those is supposed to allow this sort of thing in a break-glass scenario.

karuppiah7890 commented 3 years ago

I was also thinking about overriding the Kubernetes version. I think we support it because there are TKrs for three Kubernetes versions given a TKG release. I'm yet to check more about the overriding though, I'll be doing that tomorrow morning

karuppiah7890 commented 3 years ago

I wasn't able to install a particular version of Kubernetes on top of Docker. Were you able to do it @mcwumbly ? I just filed an issue #1264

mcwumbly commented 3 years ago

No, I haven't had a chance to try doing that myself.

adobley commented 3 years ago

We were able to hack in a downgrade of Kubernetes (v1.20.8) and verified that a cluster does get past the kubelet cgroups error. This suggests to us that running a cluster in a container may work once this cgroups fix is cherry-picked down to 1.21.

vmware-tanzu / community-edition