vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.4k stars 1.37k forks source link

Fail early DM pod checks have false positives #7898

Closed sseago closed 3 days ago

sseago commented 1 week ago

What steps did you take and what happened: In Velero 1.13, we added code to fail early if the datamover pods are in an unrecoverable state: https://github.com/vmware-tanzu/velero/pull/7052 In 1.14, some additional changes were made here: https://github.com/vmware-tanzu/velero/pull/7437

In our OADP testing with Velero 1.14, we found that in some cases DDs or DUs were being canceled almost immediately. Digging into the code added in the above 2 PRs, there are a few situations where Velero considers a pod unrecoverable: 1) pod phase of Failed or Unknown 2) pod phase of Pending with an "unschedulable" condition 3) container status with ImagePullBackOff or ErrImgNeverPull

It was the second situation that was causing the failure for us. The pod will have an "unschedulable" condition until the PVC is is bound to a PV. In many cases, this all happens quickly enough that by the time Velero checks, the PVC is bound and the Pod is no longer unschedulable, but if the provisioner takes a few seconds to bind the PV and PVC together, the Pod may still have an "unschedulable" condition when the node agent first checks, which will cause the DU or DD to be immediately canceled.

We are seeing this happen frequently with Ceph volumes.

We have a couple of options here.

1) The simplest option would be to remove "unschedulable" from the list of unrecoverable conditions. As we see here, "unschedulable" does not always indicate an unrecoverable state. The only downside here is that we lose fast fail for permanently-unschedulable pods.

2) Slightly more complicated -- we could wrap the unschedulable check in a PollUntilContextTimeout call with some hard-coded timeout (2 minutes maybe?) -- this will result in almost-fast-failing for unschedulable pods. This assumes that no provisioner should take more than 2 minutes to provision the PV. That's probably safe.

I will be posting a draft PR for option 1) -- I'm open to moving to 2) if the consensus is that it's a better solution.

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

Lyndon-Li commented 1 week ago

@sseago

if the provisioner takes a few seconds to bind the PV and PVC together, the Pod may still have an "unschedulable" condition

During data mover restore, I think this happens for almost every volume ---- the workload pod and PVC is firstly created, but the workload PVC is not bound to any PV until the DDCR completes, this won't complete in seconds/minutes. However, we didn't see this problem during the test.

Or do you mean this is an intermediate moment when the PV is provisioned and it takes some seconds to bound the PVC and PV.

sseago commented 1 week ago

It is definitely an intermediate scenario. I think some provisioners are fast enough that the code doesn't force cancel of DU/DD with the current code, but we have discovered with Ceph as a CSI provisioner, we are seeing frequent situations where DU or DD are getting canceled because the "unschedulable" pod state is considered "unrecoverable" by Velero even though it's not unrecoverable.

kaovilai commented 1 week ago

I personally think ignoring unschedulable makes the most sense.. an admin (or autoscaler) can scale up more nodes etc for scheduling to resolve. timing out at 1 minute and marking restore fail can be misleading. velero did everything right, infra was just not yet avail to schedule but could be totally fine 30 minutes later.

Lyndon-Li commented 1 week ago

Some background about the pod unrecoverable checking: This is introduced by the data mover backup node selection, as described here.

This is not a must have for node selection, it just make the backup fail earlier. But if we want keep this benefit, one possible way is to check more on the message along with the Unrecoverable status:

message: '0/2 nodes are available: 1 node(s) didn''t match Pod''s node affinity/selector,
        1 node(s) had volume node affinity conflict. preemption: 0/2 nodes are available:
        2 Preemption is not helpful for scheduling..'

@sseago Could you help to share the message in the same place when you see the problem for Ceph? Let's see if we can make some differences from the messages. E.g., can we just filter the didn''t match Pod''s node affinity/selector in the message?

sseago commented 1 week ago

@Lyndon-Li here is the error we saw with Ceph:

 message: 'found a dataupload openshift-adp/backup20-llp79 with expose error: Pod
  is unschedulable: 0/6 nodes are available: pod has unbound immediate PersistentVolumeClaims.
  preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling...
  mark it as cancel'

The "unbound immediate PersistentVolumeClaims" part is probably the most relevant part of the message. Unbound PVC is a temporary error.

Lyndon-Li commented 1 week ago

@sseago Then I think we can preserve the benefit by doing more check on the message though it is a little bit trick (when the message is changed by kubernetes, we also need to change), specifically:

What do you think?

cc @reasonerjt

sseago commented 1 week ago

@Lyndon-Li Looks like we have another bug seeing the same thing -- this time with node autoscaling -- i.e. no nodes available initially, but wait a little bit and new nodes are made available, but the DU has already been canceled: https://github.com/vmware-tanzu/velero/issues/7910

sseago commented 1 week ago

@Lyndon-Li given that even "no nodes available" isn't necessarily permanent, I'm wondering whether the message-parsing approach might be too error-prone, missing edge cases, etc.

Lyndon-Li commented 5 days ago

We further discussed this issue and realized that there is no reliable way to filter out the unschedule problem caused by users' data mover node selection. Therefore, we suggest to simply remove the changes remove the "UnReachable" container message check and let the DU/DD being cancelled after 30min timeout. This should be what https://github.com/vmware-tanzu/velero/pull/7899 is doing + data mover node selection design change.

@sseago Let us know your thoughts.

sseago commented 4 days ago

@Lyndon-Li so it sounds like you're saying the change in #7899 is appropriate? We still fail early for ImagePullBackOff or Failed pods, but for unschedulable, we just let the 30 minute timeout take effect? I'm marking that PR as ready for review now -- we can approve/merge that one or continue to discuss changes needed there.

Lyndon-Li commented 4 days ago

Yes, #7899 has been merged. Meanwhile, I will submit a PR to change the node-selection design.

sseago commented 3 days ago

@Lyndon-Li reopening for 1.14.1 -- let me know if it's better to create a new issue instead.

kaovilai commented 3 days ago

IMO, cherrypicks don't have to reference a still opened issue, one could have cherrypicks that close an already closed issue for a different release branch.

Lyndon-Li commented 3 days ago

@Lyndon-Li reopening for 1.14.1 -- let me know if it's better to create a new issue instead.

No problem, we can use the same issue to track all (it was auto closed when merging #7899)

Lyndon-Li commented 3 days ago

Close this issue as all being tracked have been completed.