vmware-tanzu / community-edition

VMware Tanzu Community Edition is no longer an actively maintained project. Code is available for historical purposes only.
https://tanzucommunityedition.io/
Apache License 2.0
1.33k stars 307 forks source link

Containers don't restart after CAPD install #832

Closed thesteve0 closed 1 year ago

thesteve0 commented 3 years ago

Bug Report

After a local standalone install, if the computer is shutdown or restarted - the containers for the kube cluster don't come back up.

Expected Behavior

By default the TCE containers in the standalone Kube cluster should start every time the docker engine starts. There should be a command or flag to turn off that default behavior

Steps to Reproduce the Bug

Install TCE standalone, turn off the computer, start it again, wait for docker to boot. There will be no cluster

Environment Details

jpmcb commented 3 years ago

Thanks for the feedback!

There was some outside discussion on this, but looks like Cluster API does provide the "restart" annotations, but we aren't using those in the core library. I think there's an opportunity to make this better in the future by bringing restart into core

Will leave this open to keep track of the issue

karuppiah7890 commented 3 years ago

Is this the same as #770 ?

jpmcb commented 3 years ago

Did some digging on this (and thanks to @stmcginnis for all the context!) and there is a path forward

Docker containers can be restarted automatically and kind has supported thisfor the last year.

The problem is with the Cluster API docker provider (CAPD) here. It needs to be added to the createNode method but https://github.com/kubernetes-sigs/cluster-api/pull/4413 will enable us to add additional properties to the configuration using the Docker SDK (instead of calling out to the docker CLI).

So, TLDR, this is a problem with CAPD and will be configurable soon. We can use this issue as a high level tracker for getting this to eventually work in TCE

stmcginnis commented 3 years ago

This should be fixed with https://github.com/kubernetes-sigs/cluster-api/pull/5021 merged to CAPD. We still need to update our dependency to CAPI v0.4.1 or later though. Since there are some major changes between our current v0.3.x dependency and v0.4.x, this may take some time.

thesteve0 commented 3 years ago

Depends on #1431

jbeda commented 3 years ago

Note that even after manually starting the clusters (docker start $(docker ps -qa)) there are somehow certificate errors when talking to workload clusters.

randomvariable commented 3 years ago

Certificate errors are likely to be related to IP changes, as per https://github.com/kubernetes-sigs/kind/issues/1689

randomvariable commented 2 years ago

Also you may see this as well https://github.com/kubernetes-sigs/cluster-api/issues/4874#issuecomment-916846773

nimbusscale commented 2 years ago

I suspect this is in the scope of this issue, but ideally, we should be able to stop and start Docker-based clusters as well. I'm personally not interested in having clusters start automatically when I reboot my laptop, but I'd like to be able to start them manually. Let me know if this needs to be tracked in a separate issue and I will create one.

randomvariable commented 2 years ago

Let me know if this needs to be tracked in a separate issue and I will create one.

I think it's the same underlying problem, but you might want to open an issue for enabling the use case via the CLI/UX. After looking into this some more, I think we discovered the root of the problem in https://github.com/etcd-io/etcd/issues/13340 . There's still some debate about the path forward, but we're dependent on some k8s & etcd changes upstream.

bradwinfield commented 2 years ago

Just to note, I restarted the docker containers (TCE management cluster) in the order that leaves the IP addresses for each container the same as before docker shutdown. I had to create a dummy container to take the .2 address that kind used during the build. This solves the certificate issue but the cluster still does not respond to tanzu nor kubectl commands.

randomvariable commented 2 years ago

Yeah, I'm pretty certain this is the etcd issues. Given the work being done on local clusters, I'm not sure how important this is anymore from a TCE perspective.

jpmcb commented 2 years ago

Leaving this open regardless of the standalone cluster overhaul: we should investigate how this works for the new standalone cluster model, adjust our kind provider as needed, and gather community feedback

For reference, here is the new standalone-cluster proposal that uses a different model with a much liter weight methodology. Please look at the proposal, try it out, and give your feedback there!

grtrout commented 2 years ago

I take it this is still an issue? I spent the last hour trying to figure out why my newly-created TCE cluster seems to be completely broken after I restarted my laptop. I get the reasons why this happens (sort of), but I thought one of the primary motivations behind TCE was to provide an experience akin to kind or minikube for local environments?

stmcginnis commented 2 years ago

I thought one of the primary motivations behind TCE was to provide an experience akin to kind or minikube for local environments?

Well, yes and no. With our standalone-cluster and unmanaged-cluster implementations, we are targeting the ability to deploy local clusters for development. But TCE as a whole, no.

@grtrout were you using the standalone-cluster command, or the new unmanaged-cluster command in the v0.10.0 release candidate?

There are issues with the way standalone-cluster works, so there is not a solution there. That command is being deprecated and replaced by unamanged-clusters.

If you use the unmanaged-cluster command, this can work but there are a couple gotchas. When deploying the cluster you will need to specify calico for the cni. The version of antrea we use by default is not able to handle the restart. On restart, the containers usually get assigned new IP addresses that cause problems there.

So the command tanzu unmanaged-cluster create --cni=calico foo should get a working cluster that (in most cases) should be able to survive a reboot. There may be some other conditions that cause problems with this, but so far in my experience it has been working fine.

grtrout commented 2 years ago

Hey @stmcginnis, thanks for the quick reply. I've used Tanzu/TKG at a previous job and since I want to continue keeping up with the Tanzu ecosystem at my new job, I thought using TCE in my local environment might be smart. I'm also thinking about using it in a homelab, but that's a different thing...

In any case, yes, I just followed along the "getting started" docs I created a standalone cluster. I had to hack a few things to get it working (e.g., toggling the deprecatedCgroupv1 value to true, restarting the process 3 or 4 times, etc.), but ultimately it was up and running and functional...until I restarted my laptop.

It looks like I should check out the unmanaged-cluster, but until now I was not aware of that. I would rather use Calico over Antrea anyway, so I'm good with that. I'll try this out later today. Thanks again!

butch7903 commented 2 years ago

I thought one of the primary motivations behind TCE was to provide an experience akin to kind or minikube for local environments?

Well, yes and no. With our standalone-cluster and unmanaged-cluster implementations, we are targeting the ability to deploy local clusters for development. But TCE as a whole, no.

@grtrout were you using the standalone-cluster command, or the new unmanaged-cluster command in the v0.10.0 release candidate?

There are issues with the way standalone-cluster works, so there is not a solution there. That command is being deprecated and replaced by unamanged-clusters.

If you use the unmanaged-cluster command, this can work but there are a couple gotchas. When deploying the cluster you will need to specify calico for the cni. The version of antrea we use by default is not able to handle the restart. On restart, the containers usually get assigned new IP addresses that cause problems there.

So the command tanzu unmanaged-cluster create --cni=calico foo should get a working cluster that (in most cases) should be able to survive a reboot. There may be some other conditions that cause problems with this, but so far in my experience it has been working fine.

Looks like the tanzu unmanaged-cluster create --cni=calico foo is the fix. My issue I am seeing now is that the kapp-controller is crashing every so often post reboot. Any ideas why that would be or how we could troubleshoot it further?

stmcginnis commented 2 years ago

@seemiller any tips on troubleshooting kapp-controller? Or someone we can pull in from the carvel project to take a look?

joshrosso commented 2 years ago

@butch7903 when you say it's crashing, can you help us understand:

butch7903 commented 2 years ago

Yes, having CrashLoopBackOff. The crashes begin right after reboot and continue infinitely. So far at 211 restarts. image

image image image image image image

joshrosso commented 2 years ago

Looks like the events are cut off, as I don't see the Error in here.

image

Also, if you want to join us in Slack, we can probably help you troubleshoot there.

Thanks!

butch7903 commented 2 years ago

Looks like I should have tested a bit more. I blew that tanzu unmanged-cluster away and simply rebuilt it, waited for all of it to completely come up, and then rebooted. The kapp controller is now no longer having the issue of restarting, so all I can think of is that I must have rebooted before it had time to complete the setup of the kapp controller causing it to go into a bad state whenever I rebooted in the future.

thesteve0 commented 2 years ago

Since this is referring to standalone rather than unmanaged perhaps we should close this one.

jpmcb commented 2 years ago

Since this is referring to standalone rather than unmanaged perhaps we should close this one.

Yes. unmanaged-cluster doesn't suffer from this original issue. If users are using antrea CNI at the time of this writing, they may still experience issues when restarting their clusters.

Reference https://github.com/vmware-tanzu/community-edition/issues/3564 for further details

jorgemoralespou commented 2 years ago

@jpmcb This issue is still relevant as it affects managed-clusters with CAPD. https://kubernetes.slack.com/archives/C02GY94A8KT/p1648732902925359

joshrosso commented 2 years ago

@jpmcb This issue is still relevant as it affects managed-clusters with CAPD. https://kubernetes.slack.com/archives/C02GY94A8KT/p1648732902925359

Agreed. Reopening.

To be entirely transparent, I don’t foresee CAPD-based restart support being in our near-term future.

jorgemoralespou commented 2 years ago

Then I would make it more prominent at the top of the doc page and not at the very bottom https://tanzucommunityedition.io/docs/v0.11/docker-install-mgmt/

RussellHamker commented 2 years ago

It appears that this issue with a management cluster reboot is that when docker comes back up, the containers do not start in the same order or do not get the same IP addresses.

My Build Process for a management cluster today in Docker: `cat < tce-mgmt.yaml CLUSTER_CIDR: 100.96.0.0/11 CLUSTER_NAME: tce-mgmt CLUSTER_PLAN: dev ENABLE_MHC: "false" IDENTITY_MANAGEMENT_TYPE: none INFRASTRUCTURE_PROVIDER: docker LDAP_BIND_DN: "" LDAP_BIND_PASSWORD: "" LDAP_GROUP_SEARCH_BASE_DN: "" LDAP_GROUP_SEARCH_FILTER: "" LDAP_GROUP_SEARCH_GROUP_ATTRIBUTE: "" LDAP_GROUP_SEARCH_NAME_ATTRIBUTE: cn LDAP_GROUP_SEARCH_USER_ATTRIBUTE: DN LDAP_HOST: "" LDAP_ROOT_CA_DATA_B64: "" LDAP_USER_SEARCH_BASE_DN: "" LDAP_USER_SEARCH_FILTER: "" LDAP_USER_SEARCH_NAME_ATTRIBUTE: "" LDAP_USER_SEARCH_USERNAME: userPrincipalName OIDC_IDENTITY_PROVIDER_CLIENT_ID: "" OIDC_IDENTITY_PROVIDER_CLIENT_SECRET: "" OIDC_IDENTITY_PROVIDER_GROUPS_CLAIM: "" OIDC_IDENTITY_PROVIDER_ISSUER_URL: "" OIDC_IDENTITY_PROVIDER_NAME: "" OIDC_IDENTITY_PROVIDER_SCOPES: "" OIDC_IDENTITY_PROVIDER_USERNAME_CLAIM: "" OS_ARCH: "" OS_NAME: "" OS_VERSION: "" SERVICE_CIDR: 100.64.0.0/13 TKG_HTTP_PROXY_ENABLED: "false" EOF

tanzu management-cluster create -f tce-mgmt.yaml --cni=calico

tanzu management-cluster kubeconfig get tce-mgmt --admin

docker network inspect kind

kubectl config use-context tce-mgmt-admin@tce-mgmt

kubectl get nodes

kubectl get po -A`

Prior to Reboot: image Post Reboot: image

Is there a way we could force the tanzu mgmt docker containers to retain their IPs possibly? If so, this might be an easy fix.

stmcginnis commented 2 years ago

the containers do not start in the same order or do not get the same IP addresses

If I remember right, this was the case when I looked closer and what I've heard from other upstream projects like kind. There really needs to be some sort of IPAM integration in docker that would allow for assigning persistent IP addresses to containers for multi-node clusters to reliable survive docker engine restarts. Without that, it's possible you can restart and get your full cluster back, but it's not very likely.

RussellHamker commented 2 years ago

Looks like they already have that... https://www.cloudsavvyit.com/14508/how-to-assign-a-static-ip-to-a-docker-container/

RussellHamker commented 2 years ago

Just need to have VMware TCE update their tanzu mgmt creation to include the specific IPs Or Offer flags to allow us to set the IPs/network for the 3 distinct docker containers....

stmcginnis commented 2 years ago

Looks like they already have that...

Yeah, the capability exists to set an address. The missing piece is an IPAM to manage what to set.

RussellHamker commented 2 years ago

I dont think IPAM is needed, you just need to set the IPs and network during the build process somehow. Where is the tanzu management yaml file stored after cluster creation?

RussellHamker commented 2 years ago

my temporary work around for this until I can find a better method is this:

Restart process

CLUSTER="tce-mgmt" docker stop $(docker ps -a -q) docker start "$CLUSTER-lb" docker start $(docker ps -a -q) docker network inspect kind kubectl config use-context $CLUSTER-admin@$CLUSTER kubectl get nodes kubectl get po -A

This causes the loadbalancer to always grab the .3 IP out of the IP pool instead of something else. The better thing long term will be to set the IP somehow for each container so that they come back up with the right IP on reboot. I will further look into this.

opsline-jvarelas commented 1 year ago

Hi @RussellHamker I've been trying to execute your workaround without success, do you remember the exact order for all containers?, because i had a management and workloads clusters running .... but after the reboot are not working......

or

is there another way to run tanzu mgm and workload clusters within a linux server?, because i found the Docker option only in the TCE version.... thanks in advance.