vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

E2E migration cases don't support k8s cluster switch correctly #8292

Open blackpiglet opened 1 month ago

blackpiglet commented 1 month ago

What steps did you take and what happened:

Prepare two k8s clusters. Run the Velero E2E test cases(including the migration cases) on those clusters. Take this CLI as an example:

CLOUD_PROVIDER=azure \
VELERO_SERVER_DEBUG_MODE=true \
DEFAULT_CLUSTER=nightly-test-1728875035788-azure-default-6-default \
STANDBY_CLUSTER=nightly-test-1728875035788-azure-standby-6-standby \
DEFAULT_CLUSTER_NAME=nightly-test-1728875035788-azure-default-6 \
STANDBY_CLUSTER_NAME=nightly-test-1728875035788-azure-standby-6 \
PLUGINS=gcr.io/velero-gcp/velero-plugin-for-microsoft-azure:main \
CREDS_FILE=/velero/workspace/E2E-debug/azure-credential BSL_CONFIG=resourceGroup=velero-nightly,storageAccount=veleronightly,subscriptionId=2261f3e7-d159-48fe-95a3-0e6a96e11159 \
BSL_BUCKET=velero-e2e-testing-1728875035788 \
ADDITIONAL_BSL_PLUGINS=gcr.io/velero-gcp/velero-plugin-for-aws:main \
ADDITIONAL_OBJECT_STORE_PROVIDER=aws ADDITIONAL_BSL_CONFIG=region=minio,s3ForcePathStyle=true,s3Url=http://minio.minio.svc:9000/ \
ADDITIONAL_BSL_BUCKET=velero-e2e-testing ADDITIONAL_BSL_PREFIX=additional \
ADDITIONAL_CREDS_FILE=/velero/workspace/E2E-debug/minio-credential-additional \
VELERO_IMAGE=gcr.io/velero-gcp/velero:main \
RESTORE_HELPER_IMAGE=gcr.io/velero-gcp/velero-restore-helper:main VERSION=main \
STANDBY_CLUSTER_CLOUD_PROVIDER=azure \
STANDBY_CLUSTER_OBJECT_STORE_PROVIDER=aws \
STANDBY_CLUSTER_PLUGINS=gcr.io/velero-gcp/velero-plugin-for-microsoft-azure:main \
DISABLE_INFORMER_CACHE=true \
VERSION=main \
REGISTRY_CREDENTIAL_FILE=/root/.docker/config.json \
GINKGO_LABELS=(!LongTime) \
KIBISHII_DIRECTORY=/velero/workspace/E2E-debug/e2e/distributed-data-generator/kubernetes/yaml/ \
make test-e2e

The E2E failed randomly. The error always happened after running a migration case.

What did you expect to happen: The E2E should run successfully.

  [FAILED] in [It] - /velero/workspace/E2E-debug/e2e/velero/test/e2e/backups/deletion.go:76 @ 10/14/24 03:36:10.91

Test case failed and fail fast is enabled. Skip resource clean up.

• [FAILED] [21.079 seconds]

Velero tests of snapshot backup deletion when kibishii is the sample workload [It] Deleted backups are deleted from object storage and backups deleted from object storage can be deleted locally [Backups, Deletion, Snapshot, SkipVanillaZfs]

/velero/workspace/E2E-debug/e2e/velero/test/e2e/backups/deletion.go:75

  [FAILED] Failed to run backup deletion test

  Expected success, but got an error:

      <*errors.withStack | 0xc0008302b8>: 

      Failed to install and prepare data for kibishii backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0: Failed to install Kibishii workload: failed to install kibishii, stderr=# Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically.

      Error from server (NotFound): error when creating "github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure": namespaces "backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0" not found

      Error from server (NotFound): error when creating "github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure": namespaces "backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0" not found

      Error from server (NotFound): error when creating "github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure": namespaces "backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0" not found

      Error from server (NotFound): error when creating "github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure": namespaces "backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0" not found

      Error from server (NotFound): error when creating "github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure": namespaces "backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0" not found

      Error from server (NotFound): error when creating "github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure": namespaces "backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0" not found

      Error from server (NotFound): error when creating "github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure": namespaces "backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0" not found

      Error from server (NotFound): error when creating "github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure": namespaces "backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0" not found

      Error from server (NotFound): error when creating "github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure": namespaces "backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0" not found

      : exit status 1

      {

          error: <*errors.withMessage | 0xc00090a380>{

              cause: <*errors.withStack | 0xc000830258>{

                  error: <*errors.withMessage | 0xc00090a360>{

                      cause: <*errors.withStack | 0xc000830228>{

                          error: <*errors.withMessage | 0xc00090a340>{

                              cause: <*exec.ExitError | 0xc00090a320>{

                                  ProcessState: {

                                      pid: 23290,

                                      status: 256,

                                      rusage: {

                                          Utime: {Sec: ..., Usec: ...},

                                          Stime: {Sec: ..., Usec: ...},

                                          Maxrss: 176904,

                                          Ixrss: 0,

                                          Idrss: 0,

                                          Isrss: 0,

                                          Minflt: 41783,

                                          Majflt: 0,

                                          Nswap: 0,

                                          Inblock: 0,

                                          Oublock: 133816,

                                          Msgsnd: 0,

                                          Msgrcv: 0,

                                          Nsignals: 0,

                                          Nvcsw: 11355,

                                          Nivcsw: 5676,

                                      },

                                  },

                                  Stderr: nil,

                              },

                              msg: "failed to install kibishii, stderr=# Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically.\nError from server (NotFound): error when creating \"github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure\": namespaces \"backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0\" not found\nError from server (NotFound): error when creating \"github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure\": namespaces \"backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0\" not found\nError from server (NotFound): error when creating \"github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure\": namespaces \"backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0\" not found\nError from server (NotFound): error when creating \"github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure\": namespaces \"backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0\" not found\nError from server (NotFound): error when creating \"github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure\": namespaces \"backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0\" not found\nError from server (NotFound): error when creating \"github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure\": namespaces \"backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0\" not found\nError from server (NotFound): error when creating \"github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure\": namespaces \"backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0\" not found\nError from server (NotFound): error when creating \"github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure\": namespaces \"backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0\" not found\nError from server (NotFound): error when creating \"github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/azure\": namespaces \"backup-deletion-1-a924f5cd-2fb0-4958-9167-6a1546f648a0\" not found\n",

                          },

                          stack: [0x1e935dd, 0x1e949e5, 0x1e98650, 0x1e97aa5, 0x89a393, 0x8ae54d, 0x47b261],

                      },

                      msg: "Failed to install Kibishii worklo...

  Gomega truncated this representation as it exceeds 'format.MaxLength'.

  Consider having the object provide a custom 'GomegaStringer' representation

  or adjust the parameters in Gomega's 'format' package.

The following information will help us better understand what's going on: This error happened due to the current E2E test cases having multiple ways to communicate with the Kubernetes API server.

The migration cases use the kubectl CLI to switch the k8s clusters. That change modifies the kubeconfig. All the CLI commands depending on the ~/.kube/config can take effect. But the client-go cannot share the same k8s cluster switch result.

The test case failure happened because the kubectl switched to the standby cluster to install the Velero, but the client-go created the backup target namespaces on the active cluster. As a result the following procedure on the standby cluster failed to find the created namespaces.

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 1 month ago

The #8293 is a temporary workaround for the error. For the long term, we need a solution to align all the communicating methods with the Kubernetes API server to control the connected k8s cluster.