🐛 fix: autoscaling doesn't respect queued items

akozhuharov commented 4 months ago

Description

We encountered a bug where the autoscaling agentpool controller doesn't take into account generally queued items in TFC and hence the replicas stay at the minimum number set in the agent pool. Steps to reproduce:

Create an agent pool with autoscaling set to minimum 1
Trigger runs on several workspaces

Usage Example

I have attached a script using parts of the code in the controller which can highlight the difference(line 32 can be added/removed). main.go.zip

References

Depends on https://github.com/hashicorp/terraform-cloud-operator/pull/419

Community Note

Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.

hashicorp-cla-app[bot] commented 4 months ago

All committers have signed the CLA.

arybolovlev commented 4 months ago

Hi @akozhuharov,

I was not able to reproduce this behavior. With no registered agents, runs transition to the plan_queued state.

Are there any other conditions for a run stuck in the queuing state?

Thanks.

akozhuharov commented 4 months ago

We had autoscaling set to 2-14 runners:

    agentTokens:
    - name: agent-pool-infra-token
    autoscaling:
      cooldownPeriodSeconds: 30
      maxReplicas: 14
      minReplicas: 2
    name: agent-pool-infra

We had 50 plans queued and the agentpool wasn't scaling beyond 2. Edit: We just upgraded to 1.5.0 and we will see how the scaling works with the syncPeriod on the agent pool.

arybolovlev commented 4 months ago

Hi @akozhuharov,

It looks like, in your case, you have a large number of workspaces attached to the agent pool. Due to this, effective reconciliation occurred every 15-20 minutes instead of the default 30 seconds. We made some changes in 2.5.0 that should address this issue.

We are looking forward to hearing your feedback on whether version 2.5.0 addressed the issue you faced.

Thanks!

arybolovlev commented 3 months ago

Hi @akozhuharov,

I will go ahead and close this PR. Please, feel free to open an issue if you encounter this problem after upgrading to 2.5.0.

Thanks!

akozhuharov commented 3 months ago

We haven't seen the same problem since then, thanks @arybolovlev

hashicorp / hcp-terraform-operator