vmware-tanzu / velero-plugin-for-vsphere

Plugin to support Velero on vSphere
Other
59 stars 49 forks source link

Data upload progress hangs up in New phase #77

Open loktionovam opened 4 years ago

loktionovam commented 4 years ago

I try using this plugin and data upload progress hangs up in New phase without any errors.

My current installation:

How to reproduce:

kubectl get pvc -n artifacts 
NAME                                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-harbor-harbor-redis-0               Bound    pvc-7ad38f5e-b249-4036-946e-71ef993e2989   1Gi        RWO            vsphere-fast   18h
data-harbor-harbor-trivy-0               Bound    pvc-74ed3888-6059-4c50-8ea1-42eca02aea4e   5Gi        RWO            vsphere-fast   18h
database-data-harbor-harbor-database-0   Bound    pvc-0501e0bb-b131-4c78-954e-862bb37768fb   2Gi        RWO            vsphere-fast   18h
harbor-harbor-chartmuseum                Bound    pvc-58177c93-e63a-4c6a-87d8-d45047afded2   5Gi        RWO            vsphere-fast   18h
harbor-harbor-jobservice                 Bound    pvc-04d9e434-d51b-47ef-8831-87b5a200eaf3   1Gi        RWO            vsphere-fast   18h
harbor-harbor-registry                   Bound    pvc-9e4a68c0-6f1a-4ca5-957b-bc85a7ac429a   5Gi        RWO            vsphere-fast   18h
export VELERO_NAMESPACE=backup-system

velero plugin add vsphereveleroplugin/velero-plugin-for-vsphere:1.0.0
An error occurred: Deployment.apps "velero" is invalid: spec.template.spec.initContainers[2].name: Duplicate value: "velero-plugin-for-vsphere"
velero plugin get                                                    
NAME                                 KIND
velero.io/crd-remap-version          BackupItemAction
velero.io/pod                        BackupItemAction
velero.io/pv                         BackupItemAction
velero.io/service-account            BackupItemAction
velero.io/aws                        ObjectStore
velero.io/add-pv-from-pvc            RestoreItemAction
velero.io/add-pvc-from-pod           RestoreItemAction
velero.io/change-pvc-node-selector   RestoreItemAction
velero.io/change-storage-class       RestoreItemAction
velero.io/cluster-role-bindings      RestoreItemAction
velero.io/crd-preserve-fields        RestoreItemAction
velero.io/job                        RestoreItemAction
velero.io/pod                        RestoreItemAction
velero.io/restic                     RestoreItemAction
velero.io/role-bindings              RestoreItemAction
velero.io/service                    RestoreItemAction
velero.io/service-account            RestoreItemAction
velero.io/aws                        VolumeSnapshotter
velero.io/vsphere                    VolumeSnapshotter
velero snapshot-location get 
NAME          PROVIDER
default       aws
vsl-vsphere   velero.io/vsphere
velero backup create my-backup1 --include-namespaces=artifacts --snapshot-volumes --volume-snapshot-locations vsl-vsphere -n backup-system
kubectl get -n backup-system uploads.veleroplugin.io -o yaml 
apiVersion: v1
items:
- apiVersion: veleroplugin.io/v1
  kind: Upload
  metadata:
    creationTimestamp: "2020-06-11T06:00:39Z"
    generation: 1
    name: upload-22ca9d7b-bc07-4fb2-a89b-f095d3372da4
    namespace: backup-system
    resourceVersion: "6537241"
    selfLink: /apis/veleroplugin.io/v1/namespaces/backup-system/uploads/upload-22ca9d7b-bc07-4fb2-a89b-f095d3372da4
    uid: 478e90e7-544f-40dd-b6a9-d5c8734647a1
  spec:
    backupTimestamp: "2020-06-11T06:00:39Z"
    snapshotID: ivd:6b26e118-4442-4e2b-8158-4543e1d894a0:22ca9d7b-bc07-4fb2-a89b-f095d3372da4
  status:
    nextRetryTimestamp: "2020-06-11T06:00:39Z"
    phase: New
    progress: {}
- apiVersion: veleroplugin.io/v1
  kind: Upload
  metadata:
    creationTimestamp: "2020-06-11T06:00:44Z"
    generation: 1
    name: upload-8c3f2e48-090b-4f06-90f4-e11e01a141f6
    namespace: backup-system
    resourceVersion: "6537266"
    selfLink: /apis/veleroplugin.io/v1/namespaces/backup-system/uploads/upload-8c3f2e48-090b-4f06-90f4-e11e01a141f6
    uid: 4ae49f43-55bd-4b6b-8ec3-0e8f18104667
  spec:
    backupTimestamp: "2020-06-11T06:00:45Z"
    snapshotID: ivd:1fb69632-cb72-4248-9d53-25cf9b4c6660:8c3f2e48-090b-4f06-90f4-e11e01a141f6
  status:
    nextRetryTimestamp: "2020-06-11T06:00:45Z"
    phase: New
    progress: {}
- apiVersion: veleroplugin.io/v1
  kind: Upload
  metadata:
    creationTimestamp: "2020-06-11T06:00:32Z"
    generation: 1
    name: upload-a14136f2-9843-44a3-b39e-652f710dcc0f
    namespace: backup-system
    resourceVersion: "6537211"
    selfLink: /apis/veleroplugin.io/v1/namespaces/backup-system/uploads/upload-a14136f2-9843-44a3-b39e-652f710dcc0f
    uid: f8b616cf-f531-4b96-a14f-4e375c881de6
  spec:
    backupTimestamp: "2020-06-11T06:00:33Z"
    snapshotID: ivd:09ddf908-1f18-4d77-910d-e1958a649052:a14136f2-9843-44a3-b39e-652f710dcc0f
  status:
    nextRetryTimestamp: "2020-06-11T06:00:33Z"
    phase: New
    progress: {}
- apiVersion: veleroplugin.io/v1
  kind: Upload
  metadata:
    creationTimestamp: "2020-06-11T06:00:21Z"
    generation: 1
    name: upload-b04dcafe-e257-4d36-8f05-d92ae754b0d3
    namespace: backup-system
    resourceVersion: "6537165"
    selfLink: /apis/veleroplugin.io/v1/namespaces/backup-system/uploads/upload-b04dcafe-e257-4d36-8f05-d92ae754b0d3
    uid: 8183c12d-ce0f-40e9-a43a-0fcfe3c9d332
  spec:
    backupTimestamp: "2020-06-11T06:00:22Z"
    snapshotID: ivd:a4f617bb-8446-491e-8352-386d4408c02b:b04dcafe-e257-4d36-8f05-d92ae754b0d3
  status:
    nextRetryTimestamp: "2020-06-11T06:00:22Z"
    phase: New
    progress: {}
- apiVersion: veleroplugin.io/v1
  kind: Upload
  metadata:
    creationTimestamp: "2020-06-11T06:00:49Z"
    generation: 1
    name: upload-b063d387-fea2-4548-9054-e7b4226609b9
    namespace: backup-system
    resourceVersion: "6537287"
    selfLink: /apis/veleroplugin.io/v1/namespaces/backup-system/uploads/upload-b063d387-fea2-4548-9054-e7b4226609b9
    uid: 7dd002ef-3887-4a57-898d-6ff2d20c1e8e
  spec:
    backupTimestamp: "2020-06-11T06:00:50Z"
    snapshotID: ivd:9e6e1200-b9d2-432c-b6e1-8c0016ef2181:b063d387-fea2-4548-9054-e7b4226609b9
  status:
    nextRetryTimestamp: "2020-06-11T06:00:50Z"
    phase: New
    progress: {}
- apiVersion: veleroplugin.io/v1
  kind: Upload
  metadata:
    creationTimestamp: "2020-06-11T06:00:26Z"
    generation: 1
    name: upload-ec2eec23-5ace-4f8d-9985-3b7bf01bf5fb
    namespace: backup-system
    resourceVersion: "6537191"
    selfLink: /apis/veleroplugin.io/v1/namespaces/backup-system/uploads/upload-ec2eec23-5ace-4f8d-9985-3b7bf01bf5fb
    uid: 07153f79-4d37-4f8b-b9ef-863aac71ca1e
  spec:
    backupTimestamp: "2020-06-11T06:00:27Z"
    snapshotID: ivd:cfec9a46-ac8a-437c-bbe9-91ba8c715e4c:ec2eec23-5ace-4f8d-9985-3b7bf01bf5fb
  status:
    nextRetryTimestamp: "2020-06-11T06:00:27Z"
    phase: New
    progress: {}
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

velero.log

lintongj commented 4 years ago

The New phase means that either the upload controllers for upload CRD are not brought up properly or the upload controllers do not work as expected. Based on the information provided above, I cannot confirm the root cause.

Once velero-plugin-for-vsphere is installed, a Daemonset of pods, "data-manager-XXX", will be brought up in the same namespace as velero pod.

Hi @loktionovam, would you please verify it and share logs of "data-manager-XXX" pods?

loktionovam commented 4 years ago

The New phase means that either the upload controllers for upload CRD are not brought up properly or the upload controllers do not work as expected. Based on the information provided above, I cannot confirm the root cause.

Once velero-plugin-for-vsphere is installed, a Daemonset of pods, "data-manager-XXX", will be brought up in the same namespace as velero pod.

Hi @loktionovam, would you please verify it and share logs of "data-manager-XXX" pods?

Hi @lintongj I can't find any pods data-manager-XXX. When I enable the plugin:

velero plugin add vsphereveleroplugin/velero-plugin-for-vsphere:1.0.0

the main velero pod restarts, vsphereveleroplugin/velero-plugin-for-vsphere:1.0.0 appear in its initContainers and that is all.

kubectl  get pods -n velero                                                                                                                                                                                                            ─╯
NAME                      READY   STATUS    RESTARTS   AGE
velero-596dd56ff9-ckwlw   1/1     Running   0          114m
  initContainers:
  - image: velero/velero-plugin-for-aws:v1.0.1
    imagePullPolicy: IfNotPresent
    name: velero-plugin-for-aws
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /target
      name: plugins
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: velero-server-token-8g297
      readOnly: true
  - image: vsphereveleroplugin/velero-plugin-for-vsphere:1.0.0
    imagePullPolicy: IfNotPresent
    name: velero-plugin-for-vsphere
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /target
      name: plugins
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: velero-server-token-8g297
      readOnly: true
lintongj commented 4 years ago

Why would you run export VELERO_NAMESPACE=backup-system while velero pod in your case is running in the namespace, velero?

What is the namespace, backup-system, used for? Would you please share kubectl -n backup-system get all?

Also, would you please share your velero deployment? kubectl -n velero get deploy/velero -o yaml

loktionovam commented 4 years ago

Sorry, there was a configuration drift when I tried to install velero in its default namespace velero (with no luck). Now, I revert configuration as described in issue:

kubectl get pods -n backup-system
NAME                     READY   STATUS    RESTARTS   AGE
minio-6c685bd979-4c2bv   1/1     Running   0          29h
velero-794555cbb-nzghc   1/1     Running   0          5m44s
kubectl -n backup-system  get deploy/velero -o yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
    meta.helm.sh/release-name: velero
    meta.helm.sh/release-namespace: backup-system
  creationTimestamp: "2020-06-11T18:02:36Z"
  generation: 2
  labels:
    app.kubernetes.io/instance: velero
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: velero
    helm.sh/chart: velero-2.12.0
  name: velero
  namespace: backup-system
  resourceVersion: "6716531"
  selfLink: /apis/apps/v1/namespaces/backup-system/deployments/velero
  uid: c9f76c19-3033-4ecf-bb01-81c5237bf32e
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: velero
      app.kubernetes.io/name: velero
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8085"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: velero
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: velero
        helm.sh/chart: velero-2.12.0
        name: velero
    spec:
      containers:
      - args:
        - server
        command:
        - /velero
        env:
        - name: VELERO_SCRATCH_DIR
          value: /scratch
        - name: VELERO_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: LD_LIBRARY_PATH
          value: /plugins
        - name: AWS_SHARED_CREDENTIALS_FILE
          value: /credentials/cloud
        image: velero/velero:v1.4.0
        imagePullPolicy: IfNotPresent
        name: velero
        ports:
        - containerPort: 8085
          name: monitoring
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /plugins
          name: plugins
        - mountPath: /credentials
          name: cloud-credentials
        - mountPath: /scratch
          name: scratch
      dnsPolicy: ClusterFirst
      initContainers:
      - image: velero/velero-plugin-for-aws:v1.0.1
        imagePullPolicy: IfNotPresent
        name: velero-plugin-for-aws
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /target
          name: plugins
      - image: vsphereveleroplugin/velero-plugin-for-vsphere:1.0.0
        imagePullPolicy: IfNotPresent
        name: velero-plugin-for-vsphere
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /target
          name: plugins
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: velero-server
      serviceAccountName: velero-server
      terminationGracePeriodSeconds: 30
      volumes:
      - name: cloud-credentials
        secret:
          defaultMode: 420
          secretName: velero
      - emptyDir: {}
        name: plugins
      - emptyDir: {}
        name: scratch
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2020-06-11T18:02:39Z"
    lastUpdateTime: "2020-06-11T18:02:39Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2020-06-11T18:02:36Z"
    lastUpdateTime: "2020-06-11T18:04:43Z"
    message: ReplicaSet "velero-794555cbb" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 2
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1
loktionovam commented 4 years ago

I guess I find the root cause:

kubectl  get events 
LAST SEEN   TYPE      REASON              OBJECT                                 MESSAGE
12m         Warning   FailedCreate        daemonset/datamgr-for-vsphere-plugin   Error creating: pods "datamgr-for-vsphere-plugin-" is forbidden: error looking up service account backup-system/velero: serviceaccount "velero" not found

I deployed velero via official helm chart:

kubectl get sa
NAME                             SECRETS   AGE
default                          1         29h
minio                            1         29h
minio-update-prometheus-secret   1         29h
velero-server                    1         18m
lintongj commented 4 years ago

I wonder whether you explictly set the ServiceAccountName to "velero-server" while installing velero via helm chart. By default, it is "velero", according to https://github.com/vmware-tanzu/velero/blob/a5346c1a87c91788aeb3e2e03be7f42ebc23d95c/pkg/install/deployment.go#L150. Would you please share what did you do to install velero via helm chart? So that we can reproduce the issue and incorporate it into our test cases.

Meanwhile, it actually exposes a bug in velero-plugin-for-vsphere, where we hardcoded the ServiceAccountName using the default one. In the default case, it works as expected. However, if users explicitly change to use a customized ServiceAccount/ServiceAccountName, instead of the default one, in velero pod, pods in daemonset/datamgr-for-vsphere-plugin cannot be brought up as expected.

It is an issue we need to resolve post-1.0.1 release (release 1.0.1 is coming soon). Before the fix is merged and released, users are highly recommended to use the default ServiceAccount/ServiceAccountName in velero pod or explicitly create a ServiceAccount, "velero", if there is none.

loktionovam commented 4 years ago

I didn't set the service account. This is my helm chart configuration:

credentials:
    useSecret: true
    secretContents:
        cloud: |
            [default]
            aws_access_key_id = access_key_here
            aws_secret_access_key = secret_access_key_here

configuration:
  provider: aws
  backupStorageLocation:
    name: default
    bucket: velero
    provider: aws
    config:
      region: minio
      s3ForcePathStyle: true
      s3Url: http://minio.backup-system.svc.devel.pro:9000

snapshotsEnabled: true

deployRestic: false

initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.0.1
    volumeMounts:
      - mountPath: /target
        name: plugins

Service account name templated via velero.serverServiceAccount helper function here:

https://github.com/vmware-tanzu/helm-charts/blob/709787fabb658891ec56ed29e50f9d1a49178b42/charts/velero/templates/serviceaccount-server.yaml#L5

{{- if .Values.serviceAccount.server.create }}
apiVersion: v1
kind: ServiceAccount
metadata:
  name: {{ include "velero.serverServiceAccount" . }}

velero.serverServiceAccount helper function code:

https://github.com/vmware-tanzu/helm-charts/blob/709787fabb658891ec56ed29e50f9d1a49178b42/charts/velero/templates/_helpers.tpl#L37

{{- define "velero.serverServiceAccount" -}}
{{- if .Values.serviceAccount.server.create -}}
    {{ default (printf "%s-%s" (include "velero.fullname" .) "server") .Values.serviceAccount.server.name }}
{{- else -}}
    {{ default "default" .Values.serviceAccount.server.name }}
{{- end -}}
{{- end -}}

Helm chart default values related to the serviceAccount:

https://github.com/vmware-tanzu/helm-charts/blob/709787fabb658891ec56ed29e50f9d1a49178b42/charts/velero/values.yaml#L174

serviceAccount:
  server:
    create: true
    name:
    annotations:

So for default serviceAccount value from the helm chart service, account name rendered via {{ default (printf "%s-%s" (include "velero.fullname" .) "server") and become equal to velero-server

loktionovam commented 4 years ago

My suggestion is to add the section "how it works" (or something like that https://blogs.vmware.com/opensource/2020/04/17/velero-plug-in-for-vsphere/) to the README where daemonset/datamgr-for-vsphere-plugin will be described.

lintongj commented 4 years ago

@loktionovam Thanks for the exploration in the helm chart configuration. On one hand, we will add the suggestion for to our documentation (FAQ.md) for released versions. On the other hand, we will fix this issue in our next release.