vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.61k stars 1.39k forks source link

Mark InProgress DataDownload/Upload as failed when status patch fails upon requeuing #7864

Open kaovilai opened 3 months ago

kaovilai commented 3 months ago

Extension of the solution described in #7863

Lyndon-Li commented 3 months ago

Could you add more details about this issue? E.g., when status patch fails for which CR should we mark InProgress DataDownload/Upload as failed?

kaovilai commented 3 months ago

If we decide to "remember" that it was completed in velero pod, next reconcile can retry patching to completion.

If there isn't a memory of that data download completing then it should be marked as failed.

Lyndon-Li commented 3 months ago

We cannot do this in reconciler, reasons:

  1. node-agent is a deamonset, so the dataupload/datadown controller is with multiple instances. When one instances mark a DUCR/DDCR as InProgress, other instances will receive a notification for the DUCR/DDCR with InProgress state
  2. There is a periodical enqueue mechansim for the dataupload/datadown controller, so in normal cases, the controller also receives notifications for the DUCR/DDCR with InProgress state from time to time
  3. Operations like Cancel need to update an InProgress DUCR/DDCR, so in these cases, the controller also receives notifications for the DUCR/DDCR with InProgress state

Therefore, it is hard to tell the the failed-to-patch case in the reconciler itself, I think there may be more corner cases than I can list above.

kaovilai commented 3 months ago

So retrying when patching may work better here than failing at reconcile.