Closed armsnyder closed 3 years ago
As an update, we tried running terraform apply
using the same config, in a fresh environment, and got a similar but different error. I can open a new issue if needed, but for now I assume this is related.
This time instead of a connection error it is a Unknown schema type [catalog.cattle.io.clusterrepo]
error.
module.stellar.rancher2_cluster_sync.this: Still creating... [1m0s elapsed]
Error: [ERROR] waiting for cluster ID (c-wkltx) downloading catalogs: [ERROR] getting catalog V2 list at cluster ID (c-wkltx): Unknown schema type [catalog.cattle.io.clusterrepo]
on ../rancher_cluster.tf line 33, in resource "rancher2_cluster_sync" "this":
33: resource "rancher2_cluster_sync" "this" {
Seeing this exact same thing
Error: [ERROR] waiting for cluster ID (c-hg92l) downloading catalogs: [ERROR] getting catalog V2 list at cluster ID (c-hg92l): Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [dial tcp 127.0.0.1:6080: connect: connection refused] from [https://mcresearchlabs.rancher.cloud/k8s/clusters/c-hg92l/v1] │ │ on ../modules/rancher/eks/cluster.tf line 36, in resource "rancher2_cluster_sync" "cluster": │ 36: resource "rancher2_cluster_sync" "cluster" {
Hi @armsnyder , the retries logic seems to be working fine, but agreed with you that should be configurable. As you mentioned, default retries (3 retries with 5s ticks) are not enough, so getting 500
errors.
I've sumitted PR https://github.com/rancher/terraform-provider-rancher2/pull/663, deprecating the retries
argument in favour of timeout
new argument. The main difference is that timeout can be configurable in more intuitive way (golang duration format), and same timeout would be applied when having rancher connection issues and when getting 500
and Unknown schema type
errors. Please, take a look
Can there be a link with the rancher2_cluster_sync.state_confirm
attribut not being used anymore?
Hi @rawmind0, I tried to set a timeout at 10 minutes to test
provider "rancher2" {
api_url = data.terraform_remote_state.mgmt_zone.outputs.rancher_api_url
insecure = true
token_key = data.terraform_remote_state.mgmt_zone.outputs.rancher_admin_token_key
timeout = "10m"
}
But I still get the same error as @armsnyder. It doesn't seem to retry connection after having 500 error:
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m0s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m10s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m20s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m30s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m40s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [3m50s elapsed]
2021/05/21 15:37:17 [DEBUG] POST https://######/api/v4/projects/59026/terraform/state/dev-zone?ID=c756936d-6916-ff95-09c7-ac874f036bd3
╷
│ Error: [ERROR] waiting for cluster ID (c-p725n) downloading catalogs: [ERROR] getting catalog V2 list at cluster ID (c-p725n): Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [dial tcp 127.0.0.1:6080: connect: connection refused] from [https://######/k8s/clusters/c-p725n/v1]
│
│ with module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync,
│ on ../../terraform-modules/downstream-cluster/main.tf line 93, in resource "rancher2_cluster_sync" "k8s_cluster_sync":
│ 93: resource "rancher2_cluster_sync" "k8s_cluster_sync" {
│
╵
2021/05/21 15:37:18 [DEBUG] DELETE https://#####/api/v4/projects/59026/terraform/state/dev-zone/lock
Releasing state lock. This may take a few moments...
Hi @lperrin-obs , thanks for reporting this.
Catalog v2 client was not configured with timeout when got new client. Cretaed PR #668 to fix it. Could you please test with it, https://github.com/rancher/terraform-provider-rancher2/pull/668#issuecomment-846026525
Released tfp v1.15.1 including the PR #668 to fix the issue.
Hi @rawmind0, I updated terraform provider to v1.15.1 but I get this error, even if the catalog is ready from Cluster explorer
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [15m30s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [15m40s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [15m50s elapsed]
module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync: Still creating... [16m0s elapsed]
2021/05/26 15:58:48 [DEBUG] POST https://#####/api/v4/projects/59026/terraform/state/dev-zone?ID=eee940c2-6580-0d6d-82ea-b9a15749416d
╷
│ Error: [ERROR] waiting for cluster ID (c-sjk8k) downloading catalogs: [ERROR] getting catalog V2 list at cluster ID (c-sjk8k): Unknown schema type [catalog.cattle.io.clusterrepo]
│
│ with module.dev_cluster.rancher2_cluster_sync.k8s_cluster_sync,
│ on ../../terraform-modules/downstream-cluster/main.tf line 93, in resource "rancher2_cluster_sync" "k8s_cluster_sync":
│ 93: resource "rancher2_cluster_sync" "k8s_cluster_sync" {
│
╵
2021/05/26 15:58:48 [DEBUG] DELETE https://#####/api/v4/projects/59026/terraform/state/dev-zone/lock
@lperrin-obs , i was unable to reproduce the reported issue. May be is it a "real" timeout?? The GetCatalogV2List
function is also taking care if returned error IsUnknownSchemaType
, https://github.com/rancher/terraform-provider-rancher2/blob/master/rancher2/config.go#L884
Specific timeout error messages was not added to the provider, adding them at PR #678
@rawmind0 I did several retries today and was unable to reproduce the issue so it should be a temporary network problem
@lperrin-obs glad to hear that. I think it was a "real" timeout, but no difference on the error message. It will be added on mentioned PR
PR https://github.com/rancher/terraform-provider-rancher2/pull/678 merged. Fix will be included on next tf provider release
Please, reopen issue if needed
Still seeing this issue under tf1.1.7 / provider version 1.22.2 Im actually adding catalogs after creating a cluster and use cluster_sync to ensure the cluster is up. I need to add a insane state_confirm value (currently at 100, going to try and decrease it) to make it wait long enough else I get the following error:
edit: its working at no value below 20
Error: Creating Catalog V2: Unknown schema type [catalog.cattle.io.clusterrepo]
with rancher2_catalog_v2.helm_catalogs[2],
on main.tf line 18, in resource "rancher2_catalog_v2" "helm_catalogs":
18: resource "rancher2_catalog_v2" "helm_catalogs" {
Still seeing this issue under tf1.1.7 / provider version 1.22.2 Im actually adding catalogs after creating a cluster and use cluster_sync to ensure the cluster is up. I need to add a insane state_confirm value (currently at 100, going to try and decrease it) to make it wait long enough else I get the following error:
edit: its working at no value below 20
Error: Creating Catalog V2: Unknown schema type [catalog.cattle.io.clusterrepo] with rancher2_catalog_v2.helm_catalogs[2], on main.tf line 18, in resource "rancher2_catalog_v2" "helm_catalogs": 18: resource "rancher2_catalog_v2" "helm_catalogs" {
I'm still seeing this issue with TF 1.3.2 and rancher provider 1.24.1.
Still seeing this issue under tf1.1.7 / provider version 1.22.2 Im actually adding catalogs after creating a cluster and use cluster_sync to ensure the cluster is up. I need to add a insane state_confirm value (currently at 100, going to try and decrease it) to make it wait long enough else I get the following error: edit: its working at no value below 20
Error: Creating Catalog V2: Unknown schema type [catalog.cattle.io.clusterrepo] with rancher2_catalog_v2.helm_catalogs[2], on main.tf line 18, in resource "rancher2_catalog_v2" "helm_catalogs": 18: resource "rancher2_catalog_v2" "helm_catalogs" {
I'm still seeing this issue with TF 1.3.2 and rancher provider 1.24.1.
I am also seeing this issue. Any updates on how to fix this?
I opened a new issue @chengksah @son-la @dbinkhuysen, please post your details there if they are different and still affecting you.
Hi.
We are trying the new
wait_for_catalogs=true
attribute setting on ourrancher2_cluster_sync
resource, on order to resolve this issue: rancher/terraform-provider-rancher2#627 (I believe this is the suggested fix, as simply taking the rancher2 terraform provider v1.14.0 did not resolve that issue.)With
wait_for_catalogs=true
we are getting Terraform apply failures due to a 500 error. After running Terraform, we can verify that the URL that the error mentions is working. I think the retry count should be increased or made configurable.What Happened
The
rancher2_cluster_sync
resource fails with a 500 status code whenwait_for_catalogs=true
What I Expected
The
rancher2_cluster_sync
resource should be more tolerant to errors, or make retry counts configurable in the provider.