Open sseago opened 4 years ago
@sseago this issue just came up again with a user.
It would be awesome if we could create an algorithm to limit the number of Stage pods per namespace to N = (number of nodes where restic is running).
Exceeding N stage Pods would be wasting resources as I understand the problem. Is there ever a reason to have more than N stage Pods that you can think of?
Alternative approach to my above suggestion for minimizing our stage pod resource consumption:
See how far we can drop the CPU and memory allocation needed for each stage pod. If each pod was consuming very minimal resources we might not need to worry about being clever with our assignment of PVCs to minimal stage pods.
I think the user issues are a result of the total requests/limits summed from stage pods adding to greater capacity than what is available on src cluster nodes.
We're already attempting to minimize memory/CPU usage. By default we use:
defaultMemory = "128Mi"
defaultCPU = "100m"
Unless the minimum values are above this, and then we have to use more. It could be that even lower numbers are possible, but I don't know whether it's possible to hit a point where even the sleep pods will fail on us.
We're currently running one stage pod per application pod (where we have pods/deployments), and one stage pod per PVC for disconnected PVCs. The problem stated in this issue is that we'll fail if we have an ROX PVC mounted in multiple pods since we will need to convert this to RWO for restore. We might be able to do what you propose here if, instead of creating a stage pod for each running pod (or PVC on a running pod) we come across, we could generate an in-memory map of PVCs to current node the pod is on. Then we could, at the end of it all, generate all of the stage pods, creating one per node with the associated running pod PVCs assigned to the appropriate node. I guess we'd just try to split up the remaining PVCs evenly among the stage pods? i.e. if we have 5 nodes and 3 remaining PVCs, each stage pod gets 0 or 1 more. If we have 3 nodes and 22 remaining PVCs, then each stage pod gets 7 or 8 extra PVCs. On the destination cluster we may not have the same number of nodes running, but that shouldn't matter. We'll still have fewer stage pods to restore than we currently have.
The immediate driver of this is the need to handle ROX volumes by initially creating them as RWO volumes on dest. This won't work if more than one stage pod mount the PVC with the current code. We also want to prevent multiple stage pods from mounting a given PVC when the volumes are mounted RWX and there are multiple application pods mounting it.
Currently we have 3 stages of stage pod creation: 1) stage pods from current running pods, using current pod as a template 2) stage pods from scaled-down Deployments, etc. using Deployment PodTemplateSpec as a template 3) stage pods for disconnected PVCs from generic template.
While the current code for stage pod creation currently prevents subsequent stage pods from being created if there's already a stage pod which contains all of the mounted PVCs, we will still duplicate the PVC in the stage pods in certain cases. For example if pod1 mounts (pvcA,pvcB) and pod2 mounts (pvcA,pvcC), then both pods will get stage pods created, and both pods will include pvcA. If this is an ROX volume, then when creating these as RWO on dest, it may fail. If it's an RWX volume, it won't cause any errors, but we'll waste effort backing up and restoring the volume twice.
I'm proposing replacing this with a two-stage stage pod creation workflow: 1) Stage pods from current running pods. However, instead of creating one stage pod per pod, we create one per PVC, refactoring the current disconnected PVC stage pod code to optionally take a nodeName, to make sure RWO volumes for currently running applications are mounted on the correct node. As with disconnected PVC code, only create the new stage pod if one doesn't already exist for this PVC. 2) Stage pods for disconnected PVCs: unchanged from current functionality, other than refactoring out common code to reuse above.