V2 Catalog not waiting for API and V2 app not waiting on catalog creation

bennysp commented 3 years ago

I am running Rancher 2.5.9 and I am using the Terraform Rancher provider version 1.17.2 with Terraform version 1.0.6.

What is happening is that I have several catalogs that install and some that don't. Then on top of that, some apps won't install because the catalog is not fully ready right at that time. When I rerun the Terraform again, it works fine. So, I am suspecting that the catalog that fails is not checking if the api is available with 200 response and the apps are not checking the catalog's status to see if it is ready.

You can see below that I am using the cluster_sync cluster id to make sure it is waiting on the cluster.

Code:

resource "rancher2_catalog_v2" "k8shome" {
  cluster_id = rancher2_cluster_sync.rancher-cluster.id
  name = "k8shome"
  url = "https://k8s-at-home.com/charts"
}

resource "rancher2_catalog_v2" "haproxytech" {
  cluster_id = rancher2_cluster_sync.rancher-cluster.id
  name = "haproxytech"
  git_repo = "https://github.com/haproxytech/helm-charts"
  git_branch = "main"
  timeouts {
    create = "20m"
    delete = "20m"
  }
}

resource "rancher2_catalog_v2" "bitnami" {
  cluster_id = rancher2_cluster_sync.rancher-cluster.id
  name = "bitnami"
  url = "https://charts.bitnami.com/bitnami"
  timeouts {
    create = "20m"
    delete = "20m"
  }
}

resource "rancher2_catalog_v2" "adwerx" {
  cluster_id = rancher2_cluster_sync.rancher-cluster.id
  name = "adwerx"
  url = "https://adwerx.github.io/charts/"
  timeouts {
    create = "20m"
    delete = "20m"
  }
}

resource "rancher2_app_v2" "longhorn" {  
  cluster_id = rancher2_cluster_sync.rancher-cluster.id
  project_id = data.rancher2_project.project_system.id
  name = "longhorn"
  namespace = "longhorn-system"
  repo_name = "rancher-charts"
  chart_name = "longhorn"
  wait = true
  timeouts {
    create = "20m"
    delete = "20m"
  }
}

resource "rancher2_app_v2" "rancher-monitoring" {
  cluster_id = rancher2_cluster_sync.rancher-cluster.id
  project_id = data.rancher2_project.project_system.id
  name = "rancher-monitoring"
  namespace = "cattle-monitoring-system"
  repo_name = "rancher-charts"
  chart_name = "rancher-monitoring"
  wait = true
  timeouts {
    create = "20m"
    delete = "20m"
  }
}

resource "rancher2_app_v2" "rancher-cis-benchmark" {  
  cluster_id = rancher2_cluster_sync.rancher-cluster.id
  project_id = data.rancher2_project.project_system.id
  name = "rancher-cis-benchmark"
  namespace = "cis-operator-system"
  repo_name = "rancher-charts"
  chart_name = "rancher-cis-benchmark"
  wait = true
  timeouts {
    create = "20m"
    delete = "20m"
  }
}

resource "rancher2_app_v2" "metallb" {  
  cluster_id = rancher2_cluster_sync.rancher-cluster.id
  project_id = data.rancher2_project.project_system.id
  name = "metallb"
  namespace = "metallb"
  repo_name = "bitnami"
  chart_name = "metallb"
  wait = true
  values = <<EOT
configInline:
  address-pools:
  - addresses:
    - 192.168.20.10-192.168.20.254
    name: bgp-service
    protocol: bgp
  peers:
  - my-asn: 64512
    peer-address: 192.168.10.1
    peer-asn: 64512
EOT
  timeouts {
    create = "20m"
    delete = "20m"
  }
}

Errors:

│ Error: [ERROR] installing App V2: Bad response statusCode [404]. Status [404 Not Found]. Body: [] from [https://rancher.local/k8s/clusters/c-g49sj/v1/catalog.cattle.io.operations/cattle-monitoring-system/helm-operation-fk8vb]
│ 
│   with rancher2_app_v2.rancher-monitoring,
│   on rancher-app-v2.tf line 15, in resource "rancher2_app_v2" "rancher-monitoring":
│   15: resource "rancher2_app_v2" "rancher-monitoring" {
│ 
╵
╷
│ Error: Getting catalog V2 ID (bitnami): Bad response statusCode [404]. Status [404 Not Found]. Body: [] from [https://rancher.local/k8s/clusters/c-g49sj/v1/catalog.cattle.io.clusterrepos/bitnami]
│ 
│   with rancher2_app_v2.metallb,
│   on rancher-app-v2.tf line 43, in resource "rancher2_app_v2" "metallb":
│   43: resource "rancher2_app_v2" "metallb" {  
│ 
╵
╷
│ Error: Creating Catalog V2: Timeout getting Catalog V2 Client at cluster ID c-g49sj: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [dial tcp 127.0.0.1:6080: connect: connection refused] from [https://rancher.local/k8s/clusters/c-g49sj/v1]
│ 
│   with rancher2_catalog_v2.haproxytech,
│   on rancher_catalogs.tf line 18, in resource "rancher2_catalog_v2" "haproxytech":
│   18: resource "rancher2_catalog_v2" "haproxytech" {

rawmind0 commented 3 years ago

The provider is waiting but it seems you are getting timeout waiting for catalog client. Have you tried to configure the provider with a higher timeout value, https://registry.terraform.io/providers/rancher/rancher2/latest/docs#timeout ??

bennysp commented 3 years ago

I have tried to play with higher timeouts, but I haven’t tried the provider one admittedly. I will check.

I switched to Rancher v2.6.0 so I will report if I have the same issue and if I do, try the same timeout.

rawmind0 commented 3 years ago

Please reopen the issue if required.

bennysp commented 3 years ago

Hi @rawmind0,

I have set the timeout using the 1.20.0 of terraform rancher provider and I am still seeing this issue. I am on 2.6.0 version of Rancher.

provider "rancher2" {
  api_url     = var.rancher_url
  access_key  = local.rancher_creds.rancher-api-key
  secret_key  = local.rancher_creds.rancher-api-secret
  insecure    = true
  timeout     = "20m"
}

Any ideas?

rawmind0 commented 3 years ago

Hi @bennysp , is your downstream cluster lasting more than 20 minutes to get active?? As mentioned, it seems you are getting a timeout waiting for catalog client.

Error: Creating Catalog V2: Timeout getting Catalog V2 Client at cluster ID c-g49sj: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [dial tcp 127.0.0.1:6080: connect: connection refused] from [https://rancher.local/k8s/clusters/c-g49sj/v1]

bennysp commented 3 years ago

@rawmind0 - It takes about 6min for the cluster itself to go into an active state. But that is where I think it is reporting active to terraform before it really is truly active.

I am assuming that if I have the Catalog V2 tied to the cluster_sync cluster ID, it should wait until it is fully ready.

I made this when I was using the 2.5.x version from Ansible because of the similar problem when I was doing V1 catalog, but I am trying to avoid this workaround for V2.

  # Due to delay in coming online, wait until it comes online before doing anything else
  - name: Rancher Cluster - Wait until cattle-cluster-agent comes online
    uri: 
      method: GET 
      status_code: 200
      url: "{{ uri_workloads }}/deployment:cattle-system:cattle-cluster-agent"
      return_content: yes
      validate_certs: no
      headers:
        Authorization: "Bearer {{ rancher_api_key }}:{{ rancher_api_secret }}"
        Content-Type: application/json
    register: cluster_agent_json
    until: cluster_agent_json.json.deploymentStatus.readyReplicas >= 1
    retries: 180
    delay: 30

  - name: Rancher Cluster - Add V1 catalogs
    uri: 
      method: POST
      status_code:
        - 201
        - 409 # Already exists is okay
      url: "{{ rancher_url_api }}/catalogs"
      body_format: json
      body:
        branch: "{{ item.branch | default('master') }}"
        url: "{{ item.v1_url }}"
        name: "{{ item.name }}"
        kind: "helm:git"
        helmVersion: "helm_v3"
      return_content: yes
      validate_certs: no
      headers:
        Authorization: "Bearer {{ rancher_api_key }}:{{ rancher_api_secret }}"
        Content-Type: application/json
    register: catalogs_json
    loop: "{{ catalogs }}"

bennysp commented 3 years ago

@rawmind0 I saw a few errors on the nodes themselves, so I increased memory/cpu for the masters and I still see random errors. There may be some other underlying issue. I will look into more of my setup and see if there is issue there.

Thanks

jrwhetse commented 1 year ago

I'm seeing this same problem with Rancher 2.6.11. I'm trying to install cluster-autoscaler app from kuberentes repo.

I have 3 clusters that all install the same helm chart. All clusters exist in AWS. The first time I ran this, it worked on 2 clusters, but failed on the third. I destroyed and recreated and it fails on 2 out 3. If I re-run a plan / apply , it works. My guess is Rancher isn't waiting for the rancher2_catalog_v2 to be fully initialized before trying to install the chart.

I've had similar issues with getting the rancher2_catalog_v2 apps created for the cluster. I ended up using rancher2_cluster_sync to ensure my entire cluster was in an Active state before adding the Catalogs. Now this is happening with rancher2_app_v2.

rancher2_app_v2.c_prod_system_cluster_autoscaler: Creation complete after 28s [id=c-2nmmf.kube-system/cluster-autoscaler]
╷
│ Error: failed to install app v2: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=ServerError, message=failed to find chartName cluster-autoscaler version 9.27.0: NotFound 404] from [https://rancher-ha-jw-mgmt.bylightsdc.bylight.com/k8s/clusters/c-fq9zp/v1/catalog.cattle.io.clusterrepos/kubernetes-autoscaler?action=install]
│ 
│   with rancher2_app_v2.c_mgmt_system_cluster_autoscaler,
│   on project_apps_v2.tf line 43, in resource "rancher2_app_v2" "c_mgmt_system_cluster_autoscaler":
│   43: resource "rancher2_app_v2" "c_mgmt_system_cluster_autoscaler" {
│ 
╵
╷
│ Error: failed to install app v2: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=ServerError, message=failed to find chartName cluster-autoscaler version 9.27.0: NotFound 404] from [https://rancher-ha-jw-mgmt.bylightsdc.bylight.com/k8s/clusters/c-ml77c/v1/catalog.cattle.io.clusterrepos/kubernetes-autoscaler?action=install]
│ 
│   with rancher2_app_v2.c_int_system_cluster_autoscaler,
│   on project_apps_v2.tf line 110, in resource "rancher2_app_v2" "c_int_system_cluster_autoscaler":
│  110: resource "rancher2_app_v2" "c_int_system_cluster_autoscaler" {

At this point, I just want a Rancher Ansible collection.

rancher / terraform-provider-rancher2

V2 Catalog not waiting for API and V2 app not waiting on catalog creation #740