[BUG] multiple resources reporting "Unknown schema type [catalog.cattle.io.clusterrepo]" on fresh cluster

pneigel-ca commented 1 year ago

Rancher Server Setup

Rancher version: 2.6.9
Installation option: Helm on EKS

Information about the Cluster

Kubernetes version: v1.24.8-rancher1-1
Cluster Type (Local/Downstream): Downstream vSphere Provisioned

User Information

What is the role of the user logged in? Global admin

Provider Information

What is the version of the Rancher v2 Terraform Provider in use? 1.24.1
What is the version of Terraform in use? 1.1.7

Describe the bug

When provisioning a new downstream cluster with terraform and automation, the cluster is created but resources in the downstream cluster encounter an error. Re-applying the same terraform after a short period of time works without issue.

To Reproduce

Create a fresh, moderate sized cluster (~20min to create) and deploy resources to the clusters' namespaces.

Actual Result

Creating resources fails with an odd error, but can be reapplied without issue or modification.

Expected Result

Cluster sync should understand when the cluster is ready for resource provisioning.

Other information

Similar issue which was "resolved" here, newer reports of the issue are almost identical so I opened a new issue: https://github.com/rancher/terraform-provider-rancher2/issues/662

We use rancher2_cluster_sync resources to ensure the cluster is up and available. Re-applying the same code with no changes works without issue.

Fail:

Rerun:

We are experiencing the problem with both rancher2_catalog_v2 as well as rancher2_app_v2 resources.

resource "rancher2_cluster_sync" "sync" {
  cluster_id    = rancher2_cluster.cluster.id
  node_pool_ids = [rancher2_node_pool.controlplane.id, rancher2_node_pool.etcd.id, rancher2_node_pool.workers.id]
  timeouts {
    create = "60m"
    update = "60m"
    delete = "60m"
  }
}

resource "rancher2_catalog_v2" "github_helm" {
  cluster_id       = rancher2_cluster_sync.sync.id
  name             = "github-helm-repo"
  git_repo         = "https://myrepo.domain/my-org/HELM-CHARTS.git"
  git_branch       = "gh-pages"
  secret_name      = "helm-repo"
  secret_namespace = data.rancher2_namespace.default.id
}

resource "rancher2_app_v2" "logging" {
  cluster_id    = rancher2_cluster_sync.sync.id
  name          = "rancher-logging"
  namespace     = "cattle-logging-system"
  repo_name     = "rancher-charts"
  chart_name    = "rancher-logging"
  chart_version = "100.1.2+up3.17.4"
}

pneigel-ca commented 1 year ago

The same code, when deployed to another environment, produces a similar but different error:

Error: Creating Catalog V2: Timeout getting Catalog V2 Client at cluster ID c-lc6wd: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [lost connection to cluster: failed to find Session for client stv-cluster-c-lc6wd] from [https://<server_url>/k8s/clusters/c-lc6wd/v1]

jrwhetse commented 1 year ago

I'm experiencing the same problem with all _v2 resources, including storage, which randomly fails alone with catalog and apps. I'm unable to get a clean full cluster creation unless I plan / apply multiple times, progressing each time until it succeeds.

gionn commented 6 months ago

cut and paste of terraform apply logs:

Tue, 16 Apr 2024 03:42:27 GMT rancher2_cluster_sync.wait_cluster_ready: Still creating... [2m30s elapsed]
Tue, 16 Apr 2024 03:42:37 GMT rancher2_cluster_sync.wait_cluster_ready: Still creating... [2m40s elapsed]
 Error: mError: [ERROR] waiting for cluster ID (c-t6kzx) downloading catalogs: [ERROR] getting catalog V2 list at cluster ID (c-t6kzx): Timeout getting catalog V2 list at cluster ID c-t6kzx: Unknown schema type [catalog.cattle.io.clusterrepo]
Tue, 16 Apr 2024 03:54:21 GMT │ 
Tue, 16 Apr 2024 03:54:21 GMT │   with rancher2_cluster_sync.wait_cluster_ready,
Tue, 16 Apr 2024 03:54:21 GMT │   on main.tf line 403, in resource "rancher2_cluster_sync" "wait_cluster_ready":
Tue, 16 Apr 2024 03:54:21 GMT │  403: resource "rancher2_cluster_sync" "wait_cluster_ready" {
Tue, 16 Apr 2024 03:54:21 GMT │ 
Tue, 16 Apr 2024 03:54:21 GMT ╵

this is the resource definition:

resource "rancher2_cluster_sync" "wait_cluster_ready" {
  cluster_id    = module.rancher2_import.cluster_id
  wait_catalogs = true

}

🤷🏻

dbsanfte commented 6 months ago

I still experience this issue in 2024 with the latest provider and a fresh cluster on Rancher 2.8.3.

I'm experimenting with state_confirm values in the rancher2_cluster_sync resource to see if it's just a matter of waiting longer. Rather hacky solution though. Also with a bigger timeout in the rancher2 provider.

Turb0Fly commented 5 months ago

Same here. Rancher 2.8.3, always have to re-apply a 2nd time my plan then it goes through. Error: Creating secret V2: Timeout getting Catalog V2 Client at cluster ID c-w9szc: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [lost connection to cluster: failed to find Session for client stv-cluster-c-w9szc] from

I do use the rancher2_cluster_sync resource as well with state_confirm. The reason I do this is to allow cluster repositories to sync before we deploy apps. Sometimes when a new helm version is available, it`s not immediately added when the repo is (for example, Longhorn ), you have to manually click refresh or wait until the cattle-cluster-agent triggers a refresh for the catalog/repo.

Before that, I would get a error saying there was no such version for the helm chart I was installing.

rancher / terraform-provider-rancher2