hashicorp / terraform-provider-kubernetes-alpha

A Terraform provider for Kubernetes that uses dynamic resource types and server-side apply. Supports all Kubernetes resources.
https://registry.terraform.io/providers/hashicorp/kubernetes-alpha/latest
Mozilla Public License 2.0
491 stars 63 forks source link

Slowness/Timeout when installing aareet/cert-manager module #182

Open superherointj opened 3 years ago

superherointj commented 3 years ago

Attempting to install aareet/cert-manager module using Kubernetes-Alpha is slower (than usual) and times out after 30min.

Versions

Demo Sources

https://github.com/superherointj/tfprovider-kubernetes-alpha-crd-cert-manager-timeout

Steps to Reproduce

https://github.com/superherointj/tfprovider-kubernetes-alpha-crd-cert-manager-timeout/blob/master/README.md

After some minutes...

It gets stuck here:

Plan: 39 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  ~ pool = [
      ~ {
          ~ nodes = [
              ~ {
                  ~ status      = "not_ready" -> "ready"
                    # (2 unchanged elements hidden)
                },
            ]
            # (3 unchanged elements hidden)
        },
    ]

Until 30min timeout from Terraform Cloud, that errors:

------------ Terraform Cloud System Message ------------

WARNING: This plan has timed out and will now terminate!

Terraform Cloud enforces a 30m0s maximum run time for this operation. Please
review the logs above to determine why the run has exceeded its timeout. You
can re-run this operation by queueing a new plan in Terraform Cloud.

--------------------------------------------------------
Failed to generate JSON: wait: remote command exited without exit status or exit signal

Expectation

Notes

-- Thanks!

aareet commented 3 years ago

Thanks for opening this issue. Do you mind sharing your debug log as a gist?

superherointj commented 3 years ago

@aareet Thanks for the heads up on the logs. Sorry for delay in answering and failing to include necessary data.

Docker (Arch Linux) and Nix logs available at: https://github.com/superherointj/tfprovider-kubernetes-alpha-crd-cert-manager-timeout/blob/master/logs/

It errors as timeout as always.

I have updated demo: https://github.com/superherointj/tfprovider-kubernetes-alpha-crd-cert-manager-timeout

Let me know if anything else is necessary.

alexsomesan commented 3 years ago

Hi, I was able to reproduce this. The root cause for the long runtime and timeout in the provider is an oversight on my side when implementing this fix: https://github.com/hashicorp/terraform-provider-kubernetes-alpha/pull/177. I should have done the same on the refresh side. The other interesting fact I noticed is that the refresh on the resource of module.cert-manager is actually triggered by the linode cluster resource being updated at the second apply. I'm not sure if that was intended, in order to demonstrate the issue, but in real-life that would be a problem. The cluster resource should not see any updates at the second apply in the script. Is something similar maybe happening to you in your real-world use-case?

Regardless, I'll follow up with a PR similar to #177 for the refresh side of things. That would throw an error instead of taking a long time to timeout. It would still not allow for the cluster resource to be updated in the same apply as Kubernetes resources.

superherointj commented 3 years ago

It was unintentional on my part.

I incorrectly guessed having created cluster first with -target parameter it wouldn't be an issue not splitting cluster creation from kubernetes provisioning. I'll separate cluster creation from kubernetes provisioning to a different directory / project. I'm just getting started on Terraform. Still learning best practices.

Thanks.

alexsomesan commented 3 years ago

The issue isn't the use of -target to isolate the creation of the cluster. That works as expected. There is a problem with the configuration of the linode_lke_cluster resource itself. I haven't yet dug into it. But even if you just create the cluster in a separate apply, with this configuration, running a plan there after the cluster is created will show differences because of how the cluster resource is configured.