rancher2_cluster_sync wait_catalogs=true causing 500 errors

armsnyder commented 3 years ago

Hi.

We are trying the new wait_for_catalogs=true attribute setting on our rancher2_cluster_sync resource, on order to resolve this issue: rancher/terraform-provider-rancher2#627 (I believe this is the suggested fix, as simply taking the rancher2 terraform provider v1.14.0 did not resolve that issue.)

With wait_for_catalogs=true we are getting Terraform apply failures due to a 500 error. After running Terraform, we can verify that the URL that the error mentions is working. I think the retry count should be increased or made configurable.

resource "rancher2_cluster_sync" "this" {
  cluster_id    = rancher2_cluster.this.id
  wait_catalogs = true
}

module.stellar.rancher2_cluster_sync.this: Still creating... [10s elapsed]
module.stellar.rancher2_cluster_sync.this: Still creating... [20s elapsed]
module.stellar.rancher2_cluster_sync.this: Still creating... [30s elapsed]
module.stellar.rancher2_cluster_sync.this: Still creating... [40s elapsed]
module.stellar.rancher2_cluster_sync.this: Still creating... [50s elapsed]
Error: [ERROR] waiting for cluster ID (c-98b2w) downloading catalogs: [ERROR] getting catalog V2 list at cluster ID (c-98b2w): Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [dial tcp 127.0.0.1:6080: connect: connection refused] from [https://redacted/k8s/clusters/c-98b2w/v1]
  on ../rancher_cluster.tf line 33, in resource "rancher2_cluster_sync" "this":

What Happened

The rancher2_cluster_sync resource fails with a 500 status code when wait_for_catalogs=true

What I Expected

The rancher2_cluster_sync resource should be more tolerant to errors, or make retry counts configurable in the provider.

armsnyder commented 3 years ago

As an update, we tried running terraform apply using the same config, in a fresh environment, and got a similar but different error. I can open a new issue if needed, but for now I assume this is related.

This time instead of a connection error it is a Unknown schema type [catalog.cattle.io.clusterrepo] error.

module.stellar.rancher2_cluster_sync.this: Still creating... [1m0s elapsed]
Error: [ERROR] waiting for cluster ID (c-wkltx) downloading catalogs: [ERROR] getting catalog V2 list at cluster ID (c-wkltx): Unknown schema type [catalog.cattle.io.clusterrepo]
  on ../rancher_cluster.tf line 33, in resource "rancher2_cluster_sync" "this":
  33: resource "rancher2_cluster_sync" "this" {

Whimby commented 3 years ago

Seeing this exact same thing Error: [ERROR] waiting for cluster ID (c-hg92l) downloading catalogs: [ERROR] getting catalog V2 list at cluster ID (c-hg92l): Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [dial tcp 127.0.0.1:6080: connect: connection refused] from [https://mcresearchlabs.rancher.cloud/k8s/clusters/c-hg92l/v1] │ │ on ../modules/rancher/eks/cluster.tf line 36, in resource "rancher2_cluster_sync" "cluster": │ 36: resource "rancher2_cluster_sync" "cluster" {

rawmind0 commented 3 years ago

Hi @armsnyder , the retries logic seems to be working fine, but agreed with you that should be configurable. As you mentioned, default retries (3 retries with 5s ticks) are not enough, so getting 500 errors.

I've sumitted PR https://github.com/rancher/terraform-provider-rancher2/pull/663, deprecating the retries argument in favour of timeout new argument. The main difference is that timeout can be configurable in more intuitive way (golang duration format), and same timeout would be applied when having rancher connection issues and when getting 500 and Unknown schema type errors. Please, take a look

mouellet commented 3 years ago

Can there be a link with the rancher2_cluster_sync.state_confirm attribut not being used anymore?

lperrin-obs commented 3 years ago

Hi @rawmind0, I tried to set a timeout at 10 minutes to test

provider "rancher2" {
  api_url    = data.terraform_remote_state.mgmt_zone.outputs.rancher_api_url
  insecure   = true
  token_key  = data.terraform_remote_state.mgmt_zone.outputs.rancher_admin_token_key
  timeout = "10m"
}

But I still get the same error as @armsnyder. It doesn't seem to retry connection after having 500 error:

module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m0s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m10s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m20s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m30s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m40s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m50s elapsed]
2021/05/21 15:37:17 [DEBUG] POST https://######/api/v4/projects/59026/terraform/state/dev-zone?ID=c756936d-6916-ff95-09c7-ac874f036bd3
╷
│ Error: [ERROR] waiting for cluster ID (c-p725n) downloading catalogs: [ERROR] getting catalog V2 list at cluster ID (c-p725n): Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [dial tcp 127.0.0.1:6080: connect: connection refused] from [https://######/k8s/clusters/c-p725n/v1]
│ 
│   with module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync,
│   on ../../terraform-modules/downstream-cluster/main.tf line 93, in resource "rancher2_cluster_sync" "k8s_cluster_sync":
│   93: resource "rancher2_cluster_sync" "k8s_cluster_sync" {
│ 
╵
2021/05/21 15:37:18 [DEBUG] DELETE https://#####/api/v4/projects/59026/terraform/state/dev-zone/lock
Releasing state lock. This may take a few moments...

rawmind0 commented 3 years ago

Hi @lperrin-obs , thanks for reporting this.

Catalog v2 client was not configured with timeout when got new client. Cretaed PR #668 to fix it. Could you please test with it, https://github.com/rancher/terraform-provider-rancher2/pull/668#issuecomment-846026525

rawmind0 commented 3 years ago

Released tfp v1.15.1 including the PR #668 to fix the issue.

lperrin-obs commented 3 years ago

Hi @rawmind0, I updated terraform provider to v1.15.1 but I get this error, even if the catalog is ready from Cluster explorer

module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [15m30s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [15m40s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [15m50s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [16m0s elapsed]
2021/05/26 15:58:48 [DEBUG] POST https://#####/api/v4/projects/59026/terraform/state/dev-zone?ID=eee940c2-6580-0d6d-82ea-b9a15749416d
╷
│ Error: [ERROR] waiting for cluster ID (c-sjk8k) downloading catalogs: [ERROR] getting catalog V2 list at cluster ID (c-sjk8k): Unknown schema type [catalog.cattle.io.clusterrepo]
│ 
│   with module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync,
│   on ../../terraform-modules/downstream-cluster/main.tf line 93, in resource "rancher2_cluster_sync" "k8s_cluster_sync":
│   93: resource "rancher2_cluster_sync" "k8s_cluster_sync" {
│ 
╵
2021/05/26 15:58:48 [DEBUG] DELETE https://#####/api/v4/projects/59026/terraform/state/dev-zone/lock

rawmind0 commented 3 years ago

@lperrin-obs , i was unable to reproduce the reported issue. May be is it a "real" timeout?? The GetCatalogV2List function is also taking care if returned error IsUnknownSchemaType, https://github.com/rancher/terraform-provider-rancher2/blob/master/rancher2/config.go#L884

Specific timeout error messages was not added to the provider, adding them at PR #678

lperrin-obs commented 3 years ago

@rawmind0 I did several retries today and was unable to reproduce the issue so it should be a temporary network problem

rawmind0 commented 3 years ago

@lperrin-obs glad to hear that. I think it was a "real" timeout, but no difference on the error message. It will be added on mentioned PR

rawmind0 commented 3 years ago

PR https://github.com/rancher/terraform-provider-rancher2/pull/678 merged. Fix will be included on next tf provider release

Please, reopen issue if needed

dbinkhuysen commented 2 years ago

Still seeing this issue under tf1.1.7 / provider version 1.22.2 Im actually adding catalogs after creating a cluster and use cluster_sync to ensure the cluster is up. I need to add a insane state_confirm value (currently at 100, going to try and decrease it) to make it wait long enough else I get the following error:

edit: its working at no value below 20

Error: Creating Catalog V2: Unknown schema type [catalog.cattle.io.clusterrepo]

   with rancher2_catalog_v2.helm_catalogs[2],
   on main.tf line 18, in resource "rancher2_catalog_v2" "helm_catalogs":
   18: resource "rancher2_catalog_v2" "helm_catalogs" {

son-la commented 2 years ago

Still seeing this issue under tf1.1.7 / provider version 1.22.2 Im actually adding catalogs after creating a cluster and use cluster_sync to ensure the cluster is up. I need to add a insane state_confirm value (currently at 100, going to try and decrease it) to make it wait long enough else I get the following error:

edit: its working at no value below 20
Error: Creating Catalog V2: Unknown schema type [catalog.cattle.io.clusterrepo]

   with rancher2_catalog_v2.helm_catalogs[2],
   on main.tf line 18, in resource "rancher2_catalog_v2" "helm_catalogs":
   18: resource "rancher2_catalog_v2" "helm_catalogs" {

I'm still seeing this issue with TF 1.3.2 and rancher provider 1.24.1.

chengksah commented 1 year ago

Still seeing this issue under tf1.1.7 / provider version 1.22.2 Im actually adding catalogs after creating a cluster and use cluster_sync to ensure the cluster is up. I need to add a insane state_confirm value (currently at 100, going to try and decrease it) to make it wait long enough else I get the following error: edit: its working at no value below 20
Error: Creating Catalog V2: Unknown schema type [catalog.cattle.io.clusterrepo]

   with rancher2_catalog_v2.helm_catalogs[2],
   on main.tf line 18, in resource "rancher2_catalog_v2" "helm_catalogs":
   18: resource "rancher2_catalog_v2" "helm_catalogs" {
I'm still seeing this issue with TF 1.3.2 and rancher provider 1.24.1.

I am also seeing this issue. Any updates on how to fix this?

pneigel-ca commented 1 year ago

I opened a new issue @chengksah @son-la @dbinkhuysen, please post your details there if they are different and still affecting you.

rancher / terraform-provider-rancher2

rancher2_cluster_sync wait_catalogs=true causing 500 errors #662

What Happened

What I Expected