terraform-google-modules / terraform-google-kubernetes-engine

Configures opinionated GKE clusters
https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google
Apache License 2.0
1.13k stars 1.16k forks source link

ASM Module Fails to Apply #626

Closed PsychoSid closed 3 years ago

PsychoSid commented 4 years ago

Every night I tear down my deployment and bring it up the following day (the names remain). I mention this as it might be due to previous credentials

Every day since the update to v0.11 modules the ASM module doesn't complete correctly.

The initial run fails with:-

module.asm.module.gke_hub_registration.null_resource.run_command[0] (local-exec): kubeconfig entry generated for anthos-gke.
module.asm.module.gke_hub_registration.null_resource.run_command[0] (local-exec): Waiting for membership to be created...
module.asm.module.gke_hub_registration.null_resource.run_command[0] (local-exec): .....done.
module.asm.module.gke_hub_registration.null_resource.run_command[0] (local-exec): Created a new membership [projects/<myproject>/locations/global/memberships/gke-asm-membership] for the cluster [gke-asm-membership]
module.asm.module.gke_hub_registration.null_resource.run_command[0]: Still creating... [10s elapsed]
module.asm.module.gke_hub_registration.null_resource.run_command[0]: Still creating... [20s elapsed]
module.asm.module.gke_hub_registration.null_resource.run_command[0] (local-exec): Error in installing the Connect Agent: Failed to apply Membership CR to cluster: error: error when retrieving current configuration of:
module.asm.module.gke_hub_registration.null_resource.run_command[0] (local-exec): Resource: "hub.gke.io/v1, Resource=memberships", GroupVersionKind: "hub.gke.io/v1, Kind=Membership"
module.asm.module.gke_hub_registration.null_resource.run_command[0] (local-exec): Name: "membership", Namespace: ""
module.asm.module.gke_hub_registration.null_resource.run_command[0] (local-exec): from server for: "STDIN": Get "https://34.89.229.66/apis/hub.gke.io/v1/memberships/membership?timeout=20s": dial tcp 34.89.229.66:443: connect: connection refused

An immediate attempt to re-apply also fails:-

Error: Error running command 'PATH=/google-cloud-sdk/bin:$PATH
.terraform/modules/asm/terraform-google-kubernetes-engine-11.0.0/modules/asm/scripts/gke_hub_registration.sh gke-asm-membership europe-west3-b anthos-gke ewogICJ0eXBlIjogInNlcnZpY2VfYWNjb3VudCIsCiAgInByb2plY3RfaWQiOiAiZ2Z0LWFudGhvcy1kZW1vIiwKICAicHJpdmF0ZV9rZXlfaWQiOiAiODU0Y2I0ZjI2NTAxYjFiYmFkMDI2YzVjODRjZDkxNDA0ZTUzODg3ZiIsCiAgInByaXZhdGVfa2V5IjogIi0tLS0tQkVHSU4gUFJJVkFURSBLRVktLS0tLVxuTUlJRXZRSUJBREFOQmdrcWhraUc5dzBCQVFFRkFBU0NCS2N3Z2dTakFnRUFBb0lCQVFDNFd6Wk5wT3hsTUxYUlxuQ05CNzVLei83enYzRkl5ZVZ5dnVKU0NHTHEzTlVHU1VTK2s2Z01hZzdQaWx5b1BsYVJIenZHclhBNlM2NXVxOFxuZkxIUjhScFR1eXJMT1pabkcwRUNDVklhRVBqby90OFJqQVNBMjNBeEcvSkNXdjcyWVNyNFB0Zk10T1ExNVpnc1xuT1V0RUYrSnowVmtGZTk0TUEza0xVOCt3bzVGMExQdE9Od3hlSHd4ckhqYjc4T0x1TnJXN2JZTm1MTWJSYjd3aVxud3c4TGNiWjlQVGJkMzBySTQvQ04zL0xjVVBpdDBnRktod3NjUzdPR21kbzZtVFQrQXpVVmNqd28yN3R3em9uM1xua25lSmVXdE9PS3dIWlQ0MzJ1eGlXNGc5ZUh3cS9uVll3T3RVQnlpZGVJTnA4NFZNOFVNWU9QL0VOdTdWUkNHTFxubGJOTGR0QjNBZ01CQUFFQ2dnRUFGVGhPZ2xacTdXVFRjTG1pZ2JnN0g0Unljd2kyL056TXppOFExTkVYcVV2SlxuSC9heTVFeUJVSEdtVnpMOXhwQzNBOGFheDZBQVBKRXEwTUpMbDM0NGlRM0FxYjY1ckttSzdJaVZIakg0N0p5MVxub1dmcjlzY0xYV081bVdDdSt6NEkrNlVFSXVocFlqaklzTUp4Z3VkNjVkamhkY3VocmVGU2MvYlVMNkZNTTBKdVxuRmFLN01hOE5VVDZvbHdKUysyTy9acHU3L0JsVUZnS1Y3cjcyZ3NQV2dyOHBFQlZwbzZqd0hMNSthTTdwSkdVV1xuUUdzcUtuc25yUE9UU2dZcHZmRHFua2s2dzF2S3J1L2QxaDZIRmgrRGFUQ1dnT00xZnp0OGpRblpkVXRaaEUyalxuWFVka0tvUjJkb01hVXovUVNlOGk3RWJoM2lHR1ZqYmZYbmEvN1ZhQW9RS0JnUURvSDA2UngySVRZcWhhK0kwNVxuc1p0UVlwOUhZc3o1MUtNanlzRlVEWGNXWlJuanQwM01hazl6S2lBdTExdmUwT1FpMHUyNWhXeENzNC9ibVorYlxucDhZMlpJaC9LTnV5SjFwa2gyLyt2Nyt1eGw2aTZaWWkxeVVNVTczS2czSXM4OUxiSjdzaXN4RTA1SWgwY28zRlxuSjVXY1VvY1IxckttdFdtRW95Uzh2TkF0bHdLQmdRRExVZ25sSnY1dnJwU1QyOG5QYk01WGVNVHFheVF6czZrRFxudmRZM0VyeCt5R0dqZ1FianJ6WG1SL3hFN3hNUmtuYjNyb3VBOFZxRjZZd2lDY1hFMXdXYnhyanZtS2xneEhqSlxuU0xBbVZZcGdmT3lYb2pSL0JIQUtqOHlCZDY5d093UlFxNzdLV0h3bmxBZVFwSTNDcW5GS2MzQ0hZMkhBbi9VSFxubDhtQjFHU1FJUUtCZ0FPbTBtNmFxMkZRc1FOVlc2dG5ydURSM0YyY0lVdGcyL3dwS1dkd0dzcUFacXJkYkZ6bFxuNDBBVmpwSU9FMFRyRmx1eDQ4bUNYdFNoeklhUTRTRHF2OFNGU2x4dHdSOEpYWE90YWNhaEw1dkpSUjNjL042cFxuY2N6QzJINkNHTjY2S3p3RllQMUh3ZUtLSWRkckllM1RGSmh6OStvQTdhaVB2QXc1SU0vVFRKY2JBb0dBT2c0Rlxue
DBtNEV2bWRjUTZyOUM5VVI4T3pMbUJEYVFQZXViUEY2OW5NdXNMS3BsNGNNbnovL2U0R0NVVGVoQUQzT1VlOFxuaTZmZXVpZnluSEYvNS9HaHAxWTV6aWdnRFFKc25zVERqMTZUY1hPYU5yM1pFWCtaNGxvbkFiekQrbDdQbjUwNlxua0JwdDhQc2lvZGxxcE8vNEExTXRDV3VHS1BORGl0UjdkRGZLTXlFQ2dZRUFoVmtkYzFNNGFzdWUyYnYyWTQzSFxuSGN6MVdpM1R6NFVxK0N6dkdkRVQ4Qkg1ZzdLTEVqd1o1NkpMZWF1L0dMRVZlOWpRMkJkOWQyTjI3ckV1aFFoN1xueStJcTFRbHZWc0NPeHk5WGdCWXcrSnRMZmxIREFLQ2dPZnF3TElRNU43VVhYTUYvQXR6aHdvcWR2N2g5Z054ZFxuOVFzdmpHN3kwdkxnV2F4bUZQOTNScGc9XG4tLS0tLUVORCBQUklWQVRFIEtFWS0tLS0tXG4iLAogICJjbGllbnRfZW1haWwiOiAiZ2tlLWh1Yi1zYUBnZnQtYW50aG9zLWRlbW8uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLAogICJjbGllbnRfaWQiOiAiMTA3NDc2NDU1MTI3NDY2MDI1NTk4IiwKICAiYXV0aF91cmkiOiAiaHR0cHM6Ly9hY2NvdW50cy5nb29nbGUuY29tL28vb2F1dGgyL2F1dGgiLAogICJ0b2tlbl91cmkiOiAiaHR0cHM6Ly9vYXV0aDIuZ29vZ2xlYXBpcy5jb20vdG9rZW4iLAogICJhdXRoX3Byb3ZpZGVyX3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20vb2F1dGgyL3YxL2NlcnRzIiwKICAiY2xpZW50X3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20vcm9ib3QvdjEvbWV0YWRhdGEveDUwOS9na2UtaHViLXNhJTQwZ2Z0LWFudGhvcy1kZW1vLmlhbS5nc2VydmljZWFjY291bnQuY29tIgp9Cg==
': exit status 1. Output: kubeconfig entry generated for anthos-gke.
ERROR: (gcloud.container.hub.memberships.register) Failed to check if the user is a cluster-admin: The connection to the server 34.89.229.66 was refused - did you specify the right host or port?

If I then run gcloud..get-credentials and re-apply everything is good.

Pretty sure the update doc I followed. Any ideas please, thanks.

morgante commented 4 years ago

Interesting, looks like we might need to add a timeout between cluster creation and attempting to add to the hub. /cc @bharathkkb

bharathkkb commented 4 years ago

Hi @PsychoSid What gcloud version are you on?

PsychoSid commented 4 years ago

v305 which is the latest I believe

PsychoSid commented 4 years ago

I looked at this again this morning when bringing up my cluster. It seemingly does need a wait as the cluster is "RECONCILING" if I wait until it's in "RUNNING" before rerunning the apply then it goes through just fine.

Thanks.

bharathkkb commented 4 years ago

@PsychoSid I think it makes sense to wait for the cluster to reconcile before we proceed. We can probably target this once we have https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/611. I have also noticed that with smaller cluster sizes ASM installs tends to force a master reconciliation which might be why it enters in RECONCILING before creating the hub membership.

I tried a apply - destroy - apply cycle with this example which seemed to work, but happy to debug further if you can provide your config.

PsychoSid commented 4 years ago

Thanks it's a 100% reproducible for me with my config/setup (it didn't happen with v0.10 - although v0.11 fixed my destroy issue !) I haven't included the .tfvars, or the backend type stuff here. issue626.txt

Thanks

bharathkkb commented 4 years ago

@PsychoSid We encountered something similar with ACM today where master was unavailable for around ~1m after the CRDs where applied producing a very similar dial tcp endpoint:443: connect: connection refused error.

I think having some kind of precondition check to make sure endpoint is available and if not a retry mechanism with a backoff might be the best approach. Happy to hear any thoughts or other ideas.

bharathkkb commented 3 years ago

Hi @PsychoSid I wanted to follow up regarding this. We had a regression fixed by https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/669 where we were not waiting for cluster to be ready, so I wanted to confirm if you were still seeing this with the latest on main.

PsychoSid commented 3 years ago

Hi @bharathkkb I haven't as I tend to use the module registry paths for sources. But will do. Many thanks.

bharathkkb commented 3 years ago

Closing this out as it should be fixed by #669 Feel free to reopen if needed