Closed mcwumbly closed 1 year ago
cc @vmware-tanzu/tce-releng
We tried something that looks like it might work: https://github.com/tylerschultz/tce/tree/e2e-external-dns-ci-gh-action
We also pushed that directly to this repo for testing the action here: https://github.com/vmware-tanzu/tce/tree/dave/delete-me
But it's currently hanging when creating the standalone cluster: https://github.com/vmware-tanzu/tce/runs/3255323470?check_suite_focus=true
cluster state is unchanged 55
cluster control plane is still being initialized, retrying
We have seen that succeed locally in the past though, so perhaps there is something to sort out with getting it to run on the GitHub actions runner. (We also see similar, though not exactly the same, failures here: https://github.com/vmware-tanzu/tce/actions/workflows/e2e-tce-docker-standalone-cluster.yaml)
cc @tylerschultz
Today, we spent our time trying to find a faster loop for reproducing the failure in CI:
Our branch has been updated: https://github.com/vmware-tanzu/tce/tree/dave/delete-me
It has the following changes (via shameless hacking):
With those changes time to build goes from ~35 min to ~5 min, at which point a standalone cluster creation is attempted (and we expect to see it hang).
Here's the latest run: https://github.com/vmware-tanzu/tce/runs/3267160114?check_suite_focus=true
capd logs show that the kubeadm init command fails on the cluster being created... not sure why though:
...
Waiting for the kubelet to boot up the control plane as static Pods from directory \"/etc/kubernetes/manifests\". This can take up to 4m0s\n[kubelet-check] Initial timeout of 40s passed.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get \"http://localhost:10248/healthz\": dial tcp [::1]:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get \"http://localhost:10248/healthz\": dial tcp [::1]:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get \"http://localhost:10248/healthz\": dial tcp [::1]:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get \"http://localhost:10248/healthz\": dial tcp [::1]:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get \"http://localhost:10248/healthz\": dial tcp [::1]:10248: connect: connection refused.\n\n\tUnfortunately, an error has occurred:\n\t\ttimed out waiting for the condition\n\n\tThis error is likely caused by:\n\t\t- The kubelet is not running\n\t\t- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)\n\n\tIf you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:\n\t\t- 'systemctl status kubelet'\n\t\t- 'journalctl -xeu kubelet'\n\n\tAdditionally, a control plane component may have crashed or exited when started by the container runtime.\n\tTo troubleshoot, list all containers using your preferred container runtimes CLI.\n\n\tHere is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:\n\t\t- 'crictl --runtime-endpoint /var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'\n\t\tOnce you have found the failing container, you can inspect its logs with:\n\t\t- 'crictl --runtime-endpoint /var/run/containerd/containerd.sock logs CONTAINERID'\n\n"
(visit that link above to the workflow run for more complete logs)
cc @karuppiah7890 as this may be helpful for comparing notes and debugging the other e2e tests in github actions as well...
Thanks a lot @mcwumbly !! I usually end up not building multiple times and instead use stable release. Today I decided that to be able to use the latest release, I'll just build once and host it in my fork repo and use https://github.com/gruntwork-io/fetch/ to fetch it and then install it using install.sh . The artifact download did end up taking sometime every time because it was around 165MB
I also created a debug script some weeks ago, it's here - https://github.com/karuppiah7890/tce/blob/e2e-docker-ga-trial/test/docker/debug-tce-install.sh
This is very very cool @mcwumbly ! 😁
With the latest Tanzu CLI that is being used in TCE, I'm not able to just install Tanzu CLI + TCE standalone cluster plugin. I just tried and it didn't work out. I'm gonna dig into this tomorrow or later I guess, I noticed this a couple of times now as part of TCE installation issues and avoided it to get around to other issues of cluster creation
could not write file: open /home/runner/.local/share/tanzu-cli/tanzu-plugin-pinniped-auth: not a directory
Used this release - https://github.com/karuppiah7890/tce/releases/tag/v0.7.0-rc.1-karuppiah , https://github.com/karuppiah7890/tce/releases/download/v0.7.0-rc.1-karuppiah/tce-linux-amd64-v0.7.0-dev.2-karuppiah.tar.gz tar ball
I think the error comes up regardless of what Tanzu command is executed. I was trying create
and delete
commands that are in standalone-cluster
plugin, and then the tanzu plugin repo add
command
@karuppiah7890 - have you ever been able to get a standalone cluster to successfully create in a GitHub actions runner? Is that something that used to work and may have broken at some point? Or is it something we have yet to get working?
It has worked before, but I think it was a long time ago. Given it's an E2E test and there are multiple components involved, providers, and the k8s cluster itself with lots of components, many things have gone wrong in the course of time
Here's a running green pipeline in my fork - https://github.com/karuppiah7890/tce/runs/3044224661?check_suite_focus=true . There are few greens in TCE repo too
But I must admit that the number of reds is too high compared to the green. Also, we run AWS E2E tests for every commit now, and it's all red all over the place 🙈 I'm wondering if it can be a nightly till all issues / errors are resolved and it's stable enough
Actually, the very few greens that were present in the actions list are gone now because of a change in workflow config yaml file name, so as of now you cannot find a single green pipeline in the main TCE repo. GitHub Actions uses the workflow config yaml file name for the URL and uses that to track the pipelines :/ Like this https://github.com/vmware-tanzu/tce/actions/workflows/e2e-all-tests.yaml?query=is%3Asuccess
Now fetching the kubelet logs on the control plane node, I see it crashing repeatedly with this error:
225 09T15:04:07.7581488Z Aug 09 15:03:23 my-cluster-control-plane-bl75w kubelet[682]: E0809 15:03:23.572437 682 node_container_manager_linux.go:57] "Failed to create cgroup" err="Unit kubepods.slice already exists." cgroupName=[kubepods]
226 09T15:04:07.7585006Z Aug 09 15:03:23 my-cluster-control-plane-bl75w kubelet[682]: E0809 15:03:23.572572 682 kubelet.go:1384] "Failed to start ContainerManager" err="Unit kubepods.slice already exists."
227 09T15:04:07.7587293Z Aug 09 15:03:23 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
228 09T15:04:07.7589085Z Aug 09 15:03:23 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Failed with result 'exit-code'.
229 09T15:04:07.7590831Z Aug 09 15:03:24 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 13.
longer snippet:
2021-08-09T15:04:07.7579296Z Aug 09 15:03:23 my-cluster-control-plane-bl75w kubelet[682]: I0809 15:03:23.566943 682 policy_none.go:44] "None policy: Start"
2021-08-09T15:04:07.7581488Z Aug 09 15:03:23 my-cluster-control-plane-bl75w kubelet[682]: E0809 15:03:23.572437 682 node_container_manager_linux.go:57] "Failed to create cgroup" err="Unit kubepods.slice already exists." cgroupName=[kubepods]
2021-08-09T15:04:07.7585006Z Aug 09 15:03:23 my-cluster-control-plane-bl75w kubelet[682]: E0809 15:03:23.572572 682 kubelet.go:1384] "Failed to start ContainerManager" err="Unit kubepods.slice already exists."
2021-08-09T15:04:07.7587293Z Aug 09 15:03:23 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
2021-08-09T15:04:07.7589085Z Aug 09 15:03:23 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Failed with result 'exit-code'.
2021-08-09T15:04:07.7590831Z Aug 09 15:03:24 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 13.
2021-08-09T15:04:07.7592505Z Aug 09 15:03:24 my-cluster-control-plane-bl75w systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
2021-08-09T15:04:07.7594074Z Aug 09 15:03:24 my-cluster-control-plane-bl75w systemd[1]: Started kubelet: The Kubernetes Node Agent.
2021-08-09T15:04:07.7596753Z Aug 09 15:03:24 my-cluster-control-plane-bl75w kubelet[731]: Flag --eviction-hard has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2021-08-09T15:04:07.7600625Z Aug 09 15:03:24 my-cluster-control-plane-bl75w kubelet[731]: Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2021-08-09T15:04:07.7604106Z Aug 09 15:03:24 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:24.980591 731 server.go:197] "Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead"
2021-08-09T15:04:07.7607277Z Aug 09 15:03:24 my-cluster-control-plane-bl75w kubelet[731]: Flag --eviction-hard has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2021-08-09T15:04:07.7610817Z Aug 09 15:03:24 my-cluster-control-plane-bl75w kubelet[731]: Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2021-08-09T15:04:07.7613907Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.008518 731 server.go:440] "Kubelet version" kubeletVersion="v1.21.2+vmware.1-360497810732255795"
2021-08-09T15:04:07.7615857Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.008900 731 server.go:851] "Client rotation is on, will bootstrap in background"
2021-08-09T15:04:07.7617960Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.011510 731 certificate_store.go:130] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".
2021-08-09T15:04:07.7620522Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.013206 731 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
2021-08-09T15:04:07.7623773Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.168871 731 server.go:660] "--cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /"
2021-08-09T15:04:07.7626014Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169137 731 container_manager_linux.go:278] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
2021-08-09T15:04:07.7634212Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169206 731 container_manager_linux.go:283] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:remote CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:systemd KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
2021-08-09T15:04:07.7643805Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169240 731 topology_manager.go:120] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
2021-08-09T15:04:07.7646156Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169255 731 container_manager_linux.go:314] "Initializing Topology Manager" policy="none" scope="container"
2021-08-09T15:04:07.7648309Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169263 731 container_manager_linux.go:319] "Creating device plugin manager" devicePluginEnabled=true
2021-08-09T15:04:07.7651101Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169348 731 util_unix.go:103] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="/var/run/containerd/containerd.sock" fullURLFormat="unix:///var/run/containerd/containerd.sock"
2021-08-09T15:04:07.7653459Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169420 731 remote_runtime.go:62] parsed scheme: ""
2021-08-09T15:04:07.7655244Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169428 731 remote_runtime.go:62] scheme "" not registered, fallback to default scheme
2021-08-09T15:04:07.7657446Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169458 731 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}
2021-08-09T15:04:07.7659553Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169467 731 clientconn.go:948] ClientConn switching balancer to "pick_first"
2021-08-09T15:04:07.7662388Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169501 731 util_unix.go:103] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="/var/run/containerd/containerd.sock" fullURLFormat="unix:///var/run/containerd/containerd.sock"
2021-08-09T15:04:07.7665256Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169513 731 remote_image.go:50] parsed scheme: ""
2021-08-09T15:04:07.7667159Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169519 731 remote_image.go:50] scheme "" not registered, fallback to default scheme
2021-08-09T15:04:07.7669350Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169532 731 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}
2021-08-09T15:04:07.7671467Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169540 731 clientconn.go:948] ClientConn switching balancer to "pick_first"
2021-08-09T15:04:07.7674064Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169601 731 kubelet.go:404] "Attempting to sync node with API server"
2021-08-09T15:04:07.7676057Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169623 731 kubelet.go:272] "Adding static pod path" path="/etc/kubernetes/manifests"
2021-08-09T15:04:07.7678239Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169644 731 kubelet.go:283] "Adding apiserver pod source"
2021-08-09T15:04:07.7680680Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169656 731 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
2021-08-09T15:04:07.7683340Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.169811 731 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
2021-08-09T15:04:07.7686902Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.184351 731 kuberuntime_manager.go:222] "Container runtime initialized" containerRuntime="containerd" version="v1.3.3-14-g449e9269" apiVersion="v1alpha2"
2021-08-09T15:04:07.7689891Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.374273 731 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
2021-08-09T15:04:07.7692285Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: For verbose messaging see aws.Config.CredentialsChainVerboseErrors
2021-08-09T15:04:07.7694261Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.375137 731 server.go:1190] "Started kubelet"
2021-08-09T15:04:07.7697627Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.376990 731 cri_stats_provider.go:369] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
2021-08-09T15:04:07.7700792Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.377110 731 kubelet.go:1306] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
2021-08-09T15:04:07.7713002Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.377612 731 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"my-cluster-control-plane-bl75w.1699ab970480c17a", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"my-cluster-control-plane-bl75w", UID:"my-cluster-control-plane-bl75w", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"my-cluster-control-plane-bl75w"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc03c6ecf565b9f7a, ext:587487459, loc:(*time.Location)(0x74bc600)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc03c6ecf565b9f7a, ext:587487459, loc:(*time.Location)(0x74bc600)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://172.18.0.3:6443/api/v1/namespaces/default/events": EOF'(may retry after sleeping)
2021-08-09T15:04:07.7722145Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.388215 731 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
2021-08-09T15:04:07.7723967Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.388441 731 server.go:149] "Starting to listen" address="0.0.0.0" port=10250
2021-08-09T15:04:07.7725680Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.395151 731 server.go:405] "Adding debug handlers to kubelet server"
2021-08-09T15:04:07.7727431Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.401007 731 volume_manager.go:271] "Starting Kubelet Volume Manager"
2021-08-09T15:04:07.7729490Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.403479 731 desired_state_of_world_populator.go:141] "Desired state populator starts to run"
2021-08-09T15:04:07.7732934Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.408557 731 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
2021-08-09T15:04:07.7735425Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.408900 731 client.go:86] parsed scheme: "unix"
2021-08-09T15:04:07.7738776Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.408984 731 client.go:86] scheme "unix" not registered, fallback to default scheme
2021-08-09T15:04:07.7742112Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.409214 731 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}
2021-08-09T15:04:07.7744370Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.409302 731 clientconn.go:948] ClientConn switching balancer to "pick_first"
2021-08-09T15:04:07.7749230Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.471386 731 manager.go:1123] Failed to create existing container: /actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7756690Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.473914 731 manager.go:1123] Failed to create existing container: /actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7761462Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.477843 731 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv4
2021-08-09T15:04:07.7763627Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.504720 731 kubelet.go:2291] "Error getting node" err="node \"my-cluster-control-plane-bl75w\" not found"
2021-08-09T15:04:07.7765766Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.586326 731 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv6
2021-08-09T15:04:07.7767676Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.586572 731 status_manager.go:157] "Starting to sync pod status with apiserver"
2021-08-09T15:04:07.7769424Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.586798 731 kubelet.go:1846] "Starting kubelet main sync loop"
2021-08-09T15:04:07.7771623Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.586930 731 kubelet.go:1870] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
2021-08-09T15:04:07.7774436Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.600832 731 kubelet_node_status.go:71] "Attempting to register node" node="my-cluster-control-plane-bl75w"
2021-08-09T15:04:07.7777125Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.601753 731 kubelet_node_status.go:93] "Unable to register node with API server" err="Post \"https://172.18.0.3:6443/api/v1/nodes\": EOF" node="my-cluster-control-plane-bl75w"
2021-08-09T15:04:07.7779932Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.606489 731 kubelet.go:2291] "Error getting node" err="node \"my-cluster-control-plane-bl75w\" not found"
2021-08-09T15:04:07.7785380Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.659313 731 manager.go:1123] Failed to create existing container: /actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7793848Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.660726 731 manager.go:1123] Failed to create existing container: /actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7801707Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: W0809 15:03:25.682751 731 manager.go:1176] Failed to process watch event {EventType:0 Name:/actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13 WatchSource:0}: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7810388Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.687644 731 kubelet.go:1870] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
2021-08-09T15:04:07.7814972Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: W0809 15:03:25.688437 731 manager.go:1176] Failed to process watch event {EventType:0 Name:/actions_job/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13 WatchSource:0}: failed to identify the read-write layer ID for container "51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13". - open /var/lib/docker/image/overlay2/layerdb/mounts/51e2c0405a8dcf5b06dd11f419411661a2efe3b97847f9926462a0064f72fe13/mount-id: no such file or directory
2021-08-09T15:04:07.7819157Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.707457 731 kubelet.go:2291] "Error getting node" err="node \"my-cluster-control-plane-bl75w\" not found"
2021-08-09T15:04:07.7821325Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724037 731 cpu_manager.go:199] "Starting CPU manager" policy="none"
2021-08-09T15:04:07.7823153Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724065 731 cpu_manager.go:200] "Reconciling" reconcilePeriod="10s"
2021-08-09T15:04:07.7824925Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724100 731 state_mem.go:36] "Initialized new in-memory state store"
2021-08-09T15:04:07.7826629Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724328 731 state_mem.go:88] "Updated default CPUSet" cpuSet=""
2021-08-09T15:04:07.7828391Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724365 731 state_mem.go:96] "Updated CPUSet assignments" assignments=map[]
2021-08-09T15:04:07.7830298Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: I0809 15:03:25.724373 731 policy_none.go:44] "None policy: Start"
2021-08-09T15:04:07.7832320Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.729766 731 node_container_manager_linux.go:57] "Failed to create cgroup" err="Unit kubepods.slice already exists." cgroupName=[kubepods]
2021-08-09T15:04:07.7835385Z Aug 09 15:03:25 my-cluster-control-plane-bl75w kubelet[731]: E0809 15:03:25.729792 731 kubelet.go:1384] "Failed to start ContainerManager" err="Unit kubepods.slice already exists."
2021-08-09T15:04:07.7837440Z Aug 09 15:03:25 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
2021-08-09T15:04:07.7839170Z Aug 09 15:03:25 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Failed with result 'exit-code'.
2021-08-09T15:04:07.7840904Z Aug 09 15:03:26 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 14.
...
2021-08-09T15:04:08.0128001Z Aug 09 15:03:42 my-cluster-control-plane-bl75w kubelet[1286]: E0809 15:03:42.072174 1286 node_container_manager_linux.go:57] "Failed to create cgroup" err="Unit kubepods.slice already exists." cgroupName=[kubepods]
2021-08-09T15:04:08.0130264Z Aug 09 15:03:42 my-cluster-control-plane-bl75w kubelet[1286]: E0809 15:03:42.072201 1286 kubelet.go:1384] "Failed to start ContainerManager" err="Unit kubepods.slice already exists."
2021-08-09T15:04:08.0132662Z Aug 09 15:03:42 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
2021-08-09T15:04:08.0134404Z Aug 09 15:03:42 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Failed with result 'exit-code'.
2021-08-09T15:04:08.0136336Z Aug 09 15:03:43 my-cluster-control-plane-bl75w systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 23.
Same, I noticed it here too after using the latest main
and your debug code - https://github.com/karuppiah7890/tce/runs/3291132133?check_suite_focus=true
I noticed that there's already a discussion going on about fixing this in our internal slack, as it seems like a K8s level thing, runc thing
https://github.com/kubernetes/kubernetes/issues/102676
clicked "subscribe" on the k/k issue.
until that lands and we're able to consume it, is there a way to override the kubernetes release we're using and set it to 1.20.x to work around this?
i see KUBERNETES_RELEASE
and KUBERNETES_VERSION
params in the ~/.config/tanzu/tkg/providers/config_default.yaml
file in the "you shouldn't be touching this" section, but it's not clear to me whether one of those is supposed to allow this sort of thing in a break-glass scenario.
I was also thinking about overriding the Kubernetes version. I think we support it because there are TKrs for three Kubernetes versions given a TKG release. I'm yet to check more about the overriding though, I'll be doing that tomorrow morning
I wasn't able to install a particular version of Kubernetes on top of Docker. Were you able to do it @mcwumbly ? I just filed an issue #1264
No, I haven't had a chance to try doing that myself.
We were able to hack in a downgrade of Kubernetes (v1.20.8) and verified that a cluster does get past the kubelet cgroups error. This suggests to us that running a cluster in a container may work once this cgroups fix is cherry-picked down to 1.21.
Feature Request
I expect the tests added in #753 to be executed regularly against latest code so we are sure that the behavior they specify still works (namely that the external dns package works), and that the test framework itself works.
We should take into consideration the contributor experience as well and make it easier if possible - not harder - to run these tests locally. By running it in CI contributors will also have executable documentation to reference when we want to run them ourselves. But there is a risk we introduce additional overhead for doing so if it is automated in a way that isn't easy for a human to execute when desired. Something to keep in mind...
Describe alternatives you've considered
Additional context
While reviewing #753, I was confronted with the fact that the tests are not part of CI. This was apparent because:
This likely depends on https://github.com/vmware-tanzu/tanzu-framework/issues/206 and #1109 and https://github.com/vmware-tanzu/tce/issues/1324 being resolved first.