datamovement should wait for node autoscaling

monotek commented 1 week ago

What steps did you take and what happened:

A lot of our datamovement uploads are canceled, with the reason below:

k -n velero get datauploads.velero.io -o yaml regional-backup-20240621030048-4qgww

...
status:
  completionTimestamp: "2024-06-21T03:34:18Z"
  message: 'found a dataupload velero/regional-backup-20240621030048-4qgww with expose
    error: Pod is unschedulable: 0/19 nodes are available: 1 node(s) had untolerated
    taint {node.kubernetes.io/network-unavailable: }, 3 node(s) had untolerated taint
    {sku: gpu}, 6 node(s) had untolerated taint {kubernetes.azure.com/scalesetpriority:
    spot}, 9 node(s) exceed max volume count. preemption: 0/19 nodes are available:
    10 Preemption is not helpful for scheduling, 9 No preemption victims found for
    incoming pod... mark it as cancel'
  phase: Canceled

What did you expect to happen:

As we have node autoscaling activated, temporary unschedulable pods should not be a problem, as the azure node autoscaler would just add another node after some time (< ~5min).

So if this is the case the velero data upload pod should stay in pending mode for a configurable timeout, so that it can be scheduled after the cluster scaled up.

Environment:

Velero version (use velero version):


velero version
Client:
Version: v1.14.0
Git commit: 2fc6300f2239f250b40b0488c35feae59520f2d3
Server:
Version: v1.14.0

- Velero features (use `velero client config get features`):

velero client config get features features:

- Kubernetes version (use `kubectl version`):

Client Version: v1.30.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.5


- Kubernetes installer & version: aks
- Cloud provider or hardware configuration: Standard_D16ds_v5
- OS (e.g. from `/etc/os-release`): ubuntu

**Vote on this issue!**

This is an invitation to the Velero community to vote on issues, you can see the project's [top voted issues listed here](https://github.com/vmware-tanzu/velero/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc).  
Use the "reaction smiley face" up to the right of this comment to vote.

- :+1: for "I would like to see this bug fixed as soon as possible"
- :-1: for "There are more important bugs to focus on right now"

sseago commented 1 week ago

Looks like another example of the bug identified here: https://github.com/vmware-tanzu/velero/issues/7898

Lyndon-Li commented 3 days ago

Fixed by changed described in #7898. Closing.

vmware-tanzu / velero

datamovement should wait for node autoscaling #7910