data manager will not start in vSphere 8.0 with Tanzu #521

ChrisJLittle commented 1 year ago

Describe the bug

After deploying and configuring the Velero data manager (ova, vSphere 8 with Tanzu), the velero-datamgr.service fails to start as it can't find the velero-token secret in the velero namespace

To Reproduce

Starting with a vSphere 8.0b with Tanzu (NSX-T 4.1) environment with Workload Management enabled. Installed Velero Operator 1.3.0 as a Supervisor Service

This by itself was problematic as the velero-vsphere-operator and velero-vsphere-operator-webhook deployments were set to tolerate "master" nodes but vSphere 8 with Tanzu supervisor control plane nodes are tainted with "control-plane". I was able to workaround this by editing the toleration on both deployments and all pods came up.

Created velero supervisor namespace and assigned permissions and storage. Created velero-vsphere-plugin-config configmap in velero namespace:

apiVersion: v1
kind: ConfigMap
  name: velero-vsphere-plugin-config
  cluster_flavor: SUPERVISOR
  vsphere_secret_name: velero-vsphere-config-secret
  vsphere_secret_namespace: velero

Installed velero-vsphere 1.4.2 binary Ran velero-vsphere install

velero-vsphere install  --namespace velero --version v1.9.2 --image velero/velero:v1.9.2 --provider aws --plugins velero/velero-plugin-for-aws:v1.6.1,vsphereveleroplugin/velero-plugin-for-vsphere:v1.4.2 --bucket velero --secret-file /home/ubuntu/Velero/s3-credentials --snapshot-location-config region=minio --backup-location-config region=minio,s3ForcePathStyle="true",s3Url=

The backup-driver deployment had the same issue as the two previously-noted deployments and had to be edited such that the tolerations were for "control-plane" and not "master"

Everything seemed to be up and running as expected at this point. Deployed the 1.4.2 data manager OVA to vsphere. Configured the advanced parameters as needed for my environment.

guestinfo.cnsdp.vcUser, guestinfo.cnsdp.vcAddress, guestinfo.cnsdp.vcPasswd, guestinfo.cnsdp.wcpControlPlaneIP

Powered on the data manager VM

I tested a backup that included a pvc and noticed that the upload never left the new state.

I logged in to the data manager VM and saw that the velero-datamgr.service service had crashed with the following output:

Mar 21 23:10:33 photon-cnsdp velero-vsphere-plugin-datamgr.sh[504]: [1B blob data]
Mar 21 23:10:33 photon-cnsdp velero-vsphere-plugin-datamgr.sh[504]: If the context you wish to use is not in this list, you may ne
ed to try
Mar 21 23:10:33 photon-cnsdp velero-vsphere-plugin-datamgr.sh[504]: logging in again later, or contact your cluster administrator.
Mar 21 23:10:33 photon-cnsdp velero-vsphere-plugin-datamgr.sh[504]: [1B blob data]
Mar 21 23:10:33 photon-cnsdp velero-vsphere-plugin-datamgr.sh[504]: To change context, use `kubectl config use-context <workload n
Mar 21 23:10:33 photon-cnsdp velero-vsphere-plugin-datamgr.sh[504]: [1B blob data]
Mar 21 23:10:33 photon-cnsdp velero-vsphere-plugin-datamgr.sh[504]: Switched to context "vi-user".
Mar 21 23:10:35 photon-cnsdp velero-vsphere-plugin-datamgr.sh[504]: Failed to get single valid velero service account
Mar 21 23:10:35 photon-cnsdp systemd[1]: velero-datamgr.service: Main process exited, code=exited, status=
Mar 21 23:10:35 photon-cnsdp systemd[1]: velero-datamgr.service: Failed with result 'exit-code'.

An examination of the /bin/velero-vsphere-plugin-datamgr.sh script showed that a secret with name containing "velero-token" was expected in the velero namespace. The following are the secrets present in the velero namespace:

NAME                                          TYPE                             DATA   AGE
cloud-credentials                             Opaque                           1      120m
velero-default-image-pull-secret              kubernetes.io/dockerconfigjson   1      123m
velero-default-image-push-secret              kubernetes.io/dockerconfigjson   1      123m
velero-restic-credentials                     Opaque                           1      118m
velero-vsphere-operator-object-store-secret   Opaque                           1      120m

Expected behavior

I'm not sure if the issue is with the data manager or velero-vsphere (and/or the operator), but there should either be a velero-token secret present or the datamanager should be looking for something else.

Troubleshooting Information

I checked the same process on vSphere 7 U3 with Tanzu and it works as expected. The velero operator version is 1.1, the velero-vsphere version is 1.1, the data manager ova version is 1.1, verlero version is 1.5.1, velero-plugin-for-aws version is 1.1.

The following were the secrets present in the velero namespace in 7.0U3:

NAME                                          TYPE                                  DATA   AGE
cloud-credentials                             Opaque                                1      509d
default-token-r6d2n                           kubernetes.io/service-account-token   3      509d
velero-restic-credentials                     Opaque                                1      509d
velero-token-25rvr                            kubernetes.io/service-account-token   3      509d
velero-vsphere-operator-object-store-secret   Opaque                                1      509d
xing-yang commented 1 year ago

Can you take a look of step 3 here https://github.com/vmware-tanzu/velero-plugin-for-vsphere/blob/main/docs/velero-vsphere-operator-user-manual.md#installing-velero-on-supervisor-cluster, did you create a configmap?

ChrisJLittle commented 1 year ago

apiVersion: v1
kind: ConfigMap
  name: velero-vsphere-plugin-config
  cluster_flavor: SUPERVISOR
  vsphere_secret_name: velero-vsphere-config-secret
  vsphere_secret_namespace: velero

It was created in the velero namespace just prior to running velero-vsphere install.

Looking at what I have, I see that I have extra parametrs, vsphere_secret_name and vsphere_secret_namespace, and I'm not sure where I got them from (maybe an older sample file?). Are these causing the problem?

deepakkinni commented 1 year ago

It shouldn't be there. Try without the extra params.

ChrisJLittle commented 1 year ago

Did a velero-vsphere uninstall, deleted/recreated the velero namespace, updated the configmap file and re-created it, redid the velero-vsphere install, still no velero-token secret in the velero namespace. Is there anything that would need to be done with the operator?

deepakkinni commented 1 year ago

Can you share the logs https://github.com/vmware-tanzu/velero-plugin-for-vsphere/blob/main/docs/troubleshooting.md#project-pacific

looking for:

  1. velero logs
  2. operator logs
ChrisJLittle commented 1 year ago

I re-did the configuration on a system where it had not been previously configured with the suspect configmap and got the same results. I'm attaching the velero and operator logs.

backup-driver.log velero.log velero-vsphere-operator.log

ChrisJLittle commented 1 year ago

And if it would help, I could grant you direct access to the environment.

ChrisJLittle commented 1 year ago

As a very quick workaround, I was able to create the missing token/secret and the container in the data manager VM came up.


apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
  name: velero-token
    kubernetes.io/service-account.name: "velero"

kubectl -n velero apply -f velero-token.yaml

kubectl -n velero describe secret velero-token

Name:         velero-token
Namespace:    velero
Labels:       <none>
Annotations:  kubernetes.io/service-account.name: velero
              kubernetes.io/service-account.uid: a868e1c1-f13b-40af-aecc-ca16e493388b

Type:  kubernetes.io/service-account-token

token:      eyJhbGciOiJSUzI1NiIsImtpZCI6Ii05RDc2OFdwLXM2QVlfM2hIdnQ5b2NoYVlSZE4tZ2RIVnlSV0pVS0FDd2sifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJ2ZWxlcm8iLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlY3JldC5uYW1lIjoidmVsZXJvLXRva2VuIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6InZlbGVybyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE4NjhlMWMxLWYxM2ItNDBhZi1hZWNjLWNhMTZlNDkzMzg4YiIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDp2ZWxlcm86dmVsZXJvIn0.IrHDLxNM_DIyx1By0nzRPBBJqv6HHgxdCpJqFKH3e9kv3pO9CUf2hlvpCXpRVlo8u33i24Z209N0P0nb1tiNgquxBbsJkJ3d4r31_6w38HHtLYEPjJc9Ct1DyR6i2gRWwT-RXfGPzffhIxTnrwdyCNhPhQQeZUp5ufwjJFuoa69M_IYKWm4LB6_HjN8TjkzHXldHsjow8ztYDV9I_izgxAgt-SLpiuo79Pk3PLNjXtp8P-DRyfIsoJ7yC5ZhPmjWwJpbWoHE5YnoCjZjJv0f81na-V1HMYeSLgDN0CscxPe0EepW_WyDd2vkepEDTGwSJWJ4IqMzPvxMWwik0aHnRA
ca.crt:     1099 bytes
namespace:  6 bytes

On the data manager VM:

docker ps

CONTAINER ID        IMAGE                                                COMMAND                  CREATED             STATUS              PORTS               NAMES
fde6ec1adb2a        vsphereveleroplugin/data-manager-for-plugin:v1.4.1   "/datamgr server --u…"   4 minutes ago       Up 4 minutes                            velero-datamgr

I'll test doing stateful backup later but this is obviously much farther along than I was getting previously.

ChrisJLittle commented 1 year ago

Stateful backup of a vSphere pod/pvc was successful.