hashicorp / terraform-cloud-operator

Kubernetes Operator allows managing HCP Terraform resources via Kubernetes Custom Resources.
https://developer.hashicorp.com/terraform/cloud-docs
Mozilla Public License 2.0
114 stars 27 forks source link

šŸš€ AgentPool has separate levers for `cooldownPeriodSeconds` and `scalingPeriodSeconds` #341

Open starlightromero opened 5 months ago

starlightromero commented 5 months ago

Description

Currently cooldownPeriodSeconds affects the time to wait between scaling events. It would be useful if the time to wait between scaling events could be detached from the time the agents stick around after a run.

I propose scalingPeriodSeconds is the time to wait between scaling events. And cooldownPeriodSeconds is the time to wait after a run before starting scalingPeriodSeconds.

Potential YAML Configuration

apiVersion: app.terraform.io/v1alpha2
kind: AgentPool
metadata:
  name: this
  namespace: default
spec:
  organization: kubernetes-operator
  token:
    secretKeyRef:
      name: tfc-operator
      key: token
  name: agent-pool-demo
  agentTokens:
    - name: white
    - name: blue
    - name: red
  agentDeployment:
    replicas: 3
    spec:
      containers:
        - name: tfc-agent
          image: "hashicorp/tfc-agent:latest"
  autoscaling:
    minReplicas: 2
    maxReplicas: 4
    cooldownPeriodSeconds: 300
    scalingPeriodSeconds: 30

References

N/A

Community Note

sheneska commented 5 months ago

Hi @starlightromero, could you please provide more context on what exactly you are asking re cooldownPeriodSeconds?

briantist commented 5 months ago

@sheneska another way to put it might be to have separate cooldowns for scale-out vs scale-in.

It's of particular concern when scaling to zero, because no agents will be launched until the cooldown period expires, no matter how big the queue is.

Right now, we work around it by having a very short cooldown period (like 1 minute), so that if there are no agents it takes at most a minute to launch one.

The downside of this is that agents disappear very quickly after a run, and having to relaunch one takes a bit of time, so it adds delay to the next run.

Ideally, after being launched, an agent sticks around for a bit, maybe 30 minutes or whatever, so that subsequent runs have an available agent to use. But if we set cooldown to 30 minutes, and it scales to zero, and then 2 minutes later we have another run, that run will wait for 28 minutes before another agent is launched.

So the ability to have asymmetrical cooldown times would be especially helpful: we want to be able to quickly scale-out in response to load, and scale-in more slowly to reduce latency for runs that start closely in time.

marianopeterson commented 4 months ago

the ability to have asymmetrical cooldown times would be especially helpful: we want to be able to quickly scale-out in response to load, and scale-in more slowly to reduce latency for runs that start closely in time.

It's important to me to be able to manage time to scale up independently from time to scale down, so that I can better manage the tradeoff between cost control and user experience.

alexsomesan commented 2 months ago

Thanks for the additional context. It's really helping us understand the impact of this potential change. We've included it as a candidate for our next round of planning.