Run as k8s jobs rather than as a single running pod.

upbound / provider-terraform

A @crossplane provider for Terraform

Apache License 2.0

142 stars 56 forks source link

Run as k8s jobs rather than as a single running pod. #189

Open JacobWeyer opened 1 year ago

JacobWeyer commented 1 year ago

What problem are you facing?

I'd like to see this operator function in a way where it would spin up workspace runs in parallel for every request (up to a max parallel limit that can be set by the user).

The intention is to use this for developer environments, load testing, integration testing and more in a very dynamic manner. The way the current operator seems to work is sequential by nature and ends up being slower as a result.

How could Official Terraform Provider help solve your problem?

By being able to run our terraform we can take advantage of crossplane's flexibility combined with helm and spin up a significant number of micro services and environments very quickly if they can run in parallel rather than being forced to wait for sequential execution. Sequential execution is a real bummer when we have something like RDS or DMS that can take up to 15 minutes to start up properly.

bobh66 commented 1 year ago

@JacobWeyer the default for the provider is to run one reconciliation at a time, but this is configurable using --max-reconcile-rate argument in a ControllerConfig and you can set it as high as you want. See https://github.com/upbound/provider-terraform/blob/main/examples/install.yaml for an example ControllerConfig.

However, since the underlying code runs the terraform CLI ,the pod will attempt to use as many CPUs as it has threads configured, so to get a "true" parallel execution you would need to make sure that there are the same number of CPUs as you set for the reconcile rate.

JacobWeyer commented 1 year ago

Will that require us to keep a massive reservation at all times rather than allowing this to be somewhat dynamic and to autoscale?

bobh66 commented 1 year ago

I'm not sure what you mean by autoscale - the pod will try to use whatever CPUs it needs, if they are available. There is no way to add more pods to the deployment, since Kubernetes controllers can only run a single instance at a time. So your worker node would need to have the CPUs available for the pod to use, but when they aren't in use they would be available for other pods on the worker to use. You might be able to use something like Karpenter to auto scale your nodegroup when a worker runs out of CPUs, and then scale in when the load is reduced.

JacobWeyer commented 1 year ago

I guess I'm confused why this was designed to run as a single instance instead of having the operator trigger each run as its own job in a similar manner to how something like Github Actions works.

bobh66 commented 1 year ago

Crossplane providers are designed to be reconciling kubernetes controllers which are responsible for maintaining the state specified in the spec field of the resource manifest. That is a different paradigm than a job dispatcher.

If each CLI command was dispatched as individual jobs they could take advantage of idle CPU resources on other workers but each would still require 1 CPU to run to completion, and it would add complexity to track the remote job completion so that subsequent reconciliations don't run while there is already a process running.

balu-ce commented 1 year ago

@bobh66 / @JacobWeyer can we do it in a master slave , where master has configuration . and making slave as replicas which is scalable. will it work ?.

JacobWeyer commented 1 year ago

Yeah that makes sense @bobh66, I'm still curious if there's a more distributed batching methodology that'd be beneficial. Especially at scale without just running more jobs in parallel on a single operator.

negz commented 7 months ago

I'm a little wary of this idea. Mostly in that I'm wary of provider-terraform diverging from how all the other Crossplane providers work.