terraform-redhat / terraform-provider-rhcs

Terraform provider for Red Hat Cloud Services
Apache License 2.0
45 stars 77 forks source link

Fixed DefaultWaitTimeoutForHCPControlPlaneInMinutes and timeouts while still installing #723

Open chrisahl opened 3 months ago

chrisahl commented 3 months ago

We are attempting to do a bulk create of 20 ROSA clusters at a time in the same AWS account and region. It appears that there is some throttling of 13 ROSA creates at a time, so it takes until one of these 13 complete until any additional ROSA deploys start running. This is leading to us seeing timeouts.

Is there a dynamic way to change the hard coded value of:
https://github.com/terraform-redhat/terraform-provider-rhcs/blob/58b45a1ef6d18c6ccf74d201c1801e7a9ebb5073/provider/clusterrosa/common/consts.go#L18 ?

Any reason for the 20 min vs something larger? Any other suggestions for achieving higher success rates?

Thanks.

willgarcia commented 1 month ago

Hi @chrisahl

I am a user of this provider and suspect having the same issue when deploying clusters in bulk.

My clusters show in ready state but Terraform fails with the following error: "Waiting for cluster creation finished with the error".

Is that the error you see?

According to the different places in code showing this message, the actual error should be added at the end of the error message but that does not seem to be the case for me. I would like to confirm it is timeout related.

At TF re-run, the clusters get deleted as well because of the TF state erroring, so it takes a long time to get lucky.

chrisahl commented 1 month ago

@willgarcia In my case I get an error saying the error is "installing" because it is timing out. I think it would be good if DefaultWaitTimeoutForHCPControlPlanInMinutes was parameterized similar to how DefaultWaitTimeoutInMinutes has the ability to use resource "rhcs_cluster_wait" "rosa_cluster" { cluster = rhcs_cluster_rosa_classic.rosa_sts_cluster.id

timeout in minutes

timeout = 60 }

because different AWS regions take longer than others to provision based on your geo location and time of day/load.

chrisahl commented 1 week ago

https://issues.redhat.com/browse/OCM-12006 was recently opened and may help get this addressed