Terraform provider should wait for ACTIVE state of K8s nodepool when destroying

salyh commented 2 weeks ago

Description

I try to provision K8s nodepools. Currently I have issues with that (Ticket 207171709) and my nodepools were not provisioned successfully. They are in an failed state or they remain in deploying state for hours. So I need to cancel the terraform apply run and rerun it. But then it fails because it can not delete the errord k8s node pools. I think in that case terraform should either wait for ACTIVE state or do something like a forceful delete of the node pool.

│ Error: error while deleting k8s node pool xxx-xxxx-xxx-xxx-xxxx: 422 Unprocessable Entity {
│   "httpStatus" : 422,
│   "messages" : [ {
│     "errorCode" : "200",
│     "message" : "[VDC-14-1832] Operation cannot be executed until Nodepool state is ACTIVE."
│   } ]
│ }

Expected behavior

Terraform should either wait for ACTIVE state or do something like a forceful delete of the node pool on destroy when node pools are are not in active state. Current workaround is to run terraform apply in a loop.

Environment

Terraform version:

OpenTofu v1.7.2

Provider version:

v6.4.17

OS:

n/a

References

Internal support ticket Ticket 207171709

adeatcu-ionos commented 1 week ago

Hello! The wait mechanism is the same for all resources:

when you create a resource, Terraform waits for that resource to become AVAILABLE and then notifies the user about the successful creation;
when you delete a resource, Terraform waits for that resource to be deleted and then notifies the user about the successful deletion;

We have long default timeouts for resource creation. If a resource didn't become AVAILABLE in that specific amount of time, it means that there is a problem, usually with the API, and most probably that resource will never become AVAILABLE. Since the resource will never become AVAILABLE, it doesn't make sense to wait for the resource to be AVAILABLE in the destroy process.

salyh commented 1 week ago

Thank you, but that's an explanation, not a solution to my problem. Is running Terraform in a loop until it succeeds the recommended approach, or are there other possible solutions?

adeatcu-ionos commented 1 week ago

@salyh sorry for not providing some general guidelines in the first place (a fixed solution doesn't really exist, it depends on the situation)

The timeout period is long and there are very few resources for which the provisioning takes very long, and even for those, the provisioning time should be smaller than the timeout period. If the timeout period finishes, most probably there is a problem with the API, and depending on the problem, you may need to get in touch with the API teams and raise the issue there.

If some operation takes longer than expected (as you experienced for the deletion of K8s nodepool), you can check the status for that resource using the API/DCD or other way in which you can tell the status of the resource. That situation with the K8s nodepool it's a good example, you noticed that the resource was in FAILED_STATE and you tried to delete the resource again and it worked, sometimes it may not work.

Usually, these should not happen and there is not much that you can do besides trying to create/delete/update requests one more time. If there is a problem with the API, another request won't make the trick and you will need to get in touch with those specific teams.

I hope this helps a little bit.

ionos-cloud / terraform-provider-ionoscloud