A lot of our datamovement uploads are canceled, with the reason below:
k -n velero get datauploads.velero.io -o yaml regional-backup-20240621030048-4qgww
...
status:
completionTimestamp: "2024-06-21T03:34:18Z"
message: 'found a dataupload velero/regional-backup-20240621030048-4qgww with expose
error: Pod is unschedulable: 0/19 nodes are available: 1 node(s) had untolerated
taint {node.kubernetes.io/network-unavailable: }, 3 node(s) had untolerated taint
{sku: gpu}, 6 node(s) had untolerated taint {kubernetes.azure.com/scalesetpriority:
spot}, 9 node(s) exceed max volume count. preemption: 0/19 nodes are available:
10 Preemption is not helpful for scheduling, 9 No preemption victims found for
incoming pod... mark it as cancel'
phase: Canceled
What did you expect to happen:
As we have node autoscaling activated, temporary unschedulable pods should not be a problem, as the azure node autoscaler would just add another node after some time (< ~5min).
So if this is the case the velero data upload pod should stay in pending mode for a configurable timeout, so that it can be scheduled after the cluster scaled up.
- Velero features (use `velero client config get features`):
velero client config get features
features:
- Kubernetes version (use `kubectl version`):
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.5
- Kubernetes installer & version: aks
- Cloud provider or hardware configuration: Standard_D16ds_v5
- OS (e.g. from `/etc/os-release`): ubuntu
**Vote on this issue!**
This is an invitation to the Velero community to vote on issues, you can see the project's [top voted issues listed here](https://github.com/vmware-tanzu/velero/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc).
Use the "reaction smiley face" up to the right of this comment to vote.
- :+1: for "I would like to see this bug fixed as soon as possible"
- :-1: for "There are more important bugs to focus on right now"
What steps did you take and what happened:
A lot of our datamovement uploads are canceled, with the reason below:
k -n velero get datauploads.velero.io -o yaml regional-backup-20240621030048-4qgww
What did you expect to happen:
As we have node autoscaling activated, temporary unschedulable pods should not be a problem, as the azure node autoscaler would just add another node after some time (< ~5min).
So if this is the case the velero data upload pod should stay in pending mode for a configurable timeout, so that it can be scheduled after the cluster scaled up.
Environment:
velero version
):velero client config get features features:
Client Version: v1.30.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.5