hashicorp / hcp-terraform-operator

Kubernetes Operator allows managing HCP Terraform resources via Kubernetes Custom Resources.
https://developer.hashicorp.com/terraform/cloud-docs
Mozilla Public License 2.0
125 stars 32 forks source link

πŸ› fix: autoscaling doesn't respect queued items #436

Closed akozhuharov closed 3 months ago

akozhuharov commented 4 months ago

Description

We encountered a bug where the autoscaling agentpool controller doesn't take into account generally queued items in TFC and hence the replicas stay at the minimum number set in the agent pool. Steps to reproduce:

  1. Create an agent pool with autoscaling set to minimum 1
  2. Trigger runs on several workspaces

Usage Example

I have attached a script using parts of the code in the controller which can highlight the difference(line 32 can be added/removed). main.go.zip

References

Community Note

hashicorp-cla-app[bot] commented 4 months ago

CLA assistant check
All committers have signed the CLA.

arybolovlev commented 4 months ago

Hi @akozhuharov,

I was not able to reproduce this behavior. With no registered agents, runs transition to the plan_queued state.

Are there any other conditions for a run stuck in the queuing state?

Thanks.

akozhuharov commented 4 months ago

We had autoscaling set to 2-14 runners:

    agentTokens:
    - name: agent-pool-infra-token
    autoscaling:
      cooldownPeriodSeconds: 30
      maxReplicas: 14
      minReplicas: 2
    name: agent-pool-infra

We had 50 plans queued and the agentpool wasn't scaling beyond 2. Edit: We just upgraded to 1.5.0 and we will see how the scaling works with the syncPeriod on the agent pool.

arybolovlev commented 4 months ago

Hi @akozhuharov,

It looks like, in your case, you have a large number of workspaces attached to the agent pool. Due to this, effective reconciliation occurred every 15-20 minutes instead of the default 30 seconds. We made some changes in 2.5.0 that should address this issue.

We are looking forward to hearing your feedback on whether version 2.5.0 addressed the issue you faced.

Thanks!

arybolovlev commented 3 months ago

Hi @akozhuharov,

I will go ahead and close this PR. Please, feel free to open an issue if you encounter this problem after upgrading to 2.5.0.

Thanks!

akozhuharov commented 3 months ago

We haven't seen the same problem since then, thanks @arybolovlev