vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.4k stars 1.37k forks source link

Job / Pod specs of jobs should be configurable #7911

Open monotek opened 1 week ago

monotek commented 1 week ago

Describe the problem/challenge you have

When Velero runs maintenance or backup jobs the pod spec is not configurable. As we have a lod of nodes which make use of taints these workloads can't be scheduled there.

Describe the solution you'd like

We would need to adjust the tolerations and affinites for such jobs. Therefore the whole job / pod template should be configurable (we also woud like to be able to configure other stuff like securtiycontexts, job history and so on).

Anything else you would like to add:

Environment:

- Velero features (use `velero client config get features`): 

velero client config get features features:

- Kubernetes version (use `kubectl version`):

Client Version: v1.30.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.5


- Kubernetes installer & version: aks
- Cloud provider or hardware configuration: Standard_D16ds_v5
- OS (e.g. from `/etc/os-release`): ubuntu

**Vote on this issue!**

This is an invitation to the Velero community to vote on issues, you can see the project's [top voted issues listed here](https://github.com/vmware-tanzu/velero/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc).  
Use the "reaction smiley face" up to the right of this comment to vote.

- :+1: for "The project would be better with this feature added"
- :-1: for "This feature will not enhance the project in a meaningful way"
ywk253100 commented 5 days ago

You can use the --dry-run option of velero install command to generate the yaml of the Velero deployment and edit it by adding the taints, then kubectl apply it. The maintenance job inherits the taints config from the Velero deployment.

monotek commented 5 days ago

But we don't want to run the Velero deployment on the same nodes as the maintenance jobs. Imho this should be configurable seperately.

For example we use spot instances in some environments, which can be restartet every time by azure. They are ideal for short running jobs but we don't want to have some system critical services on them...

Also stuff like the job history limit we can't configure in the deployment.

ywk253100 commented 5 days ago

Configuring the node selector for the maintenance job is a valid use case tracked by https://github.com/vmware-tanzu/velero/issues/7758.

monotek commented 4 days ago

We have a lot of nodes so we're working with topologyspreadconstraints and node affinites to spread services evenly over our nodes. Imho the whole pod and job spec should be configurable. Single parts like the somewhat static nodeselector are not enough for us.

blackpiglet commented 3 days ago

Please correct me if I'm wrong. TopologySpreadConstraints may not work for the maintenance Job pod. The reason is that the maintenance Job doesn't need to specify the Parallelism parameter of the JobSpec, so there is always only one pod created for the Job, which is expected behavior.

So NodeAffinities is enough for this case. What's your opinion?

monotek commented 3 days ago

True, for a job node affinity together with tolerations would be used.

Currently we have 133 finished (most of them are repo-maintain-job) jobs in the velero namepsace. Would be nice we could somehow influence that.

It might help if the jobs would be created out of cronjob resource.