Closed bennysp closed 3 years ago
The provider is waiting but it seems you are getting timeout waiting for catalog client. Have you tried to configure the provider with a higher timeout value, https://registry.terraform.io/providers/rancher/rancher2/latest/docs#timeout ??
I have tried to play with higher timeouts, but I haven’t tried the provider one admittedly. I will check.
I switched to Rancher v2.6.0 so I will report if I have the same issue and if I do, try the same timeout.
Please reopen the issue if required.
Hi @rawmind0,
I have set the timeout using the 1.20.0 of terraform rancher provider and I am still seeing this issue. I am on 2.6.0 version of Rancher.
provider "rancher2" {
api_url = var.rancher_url
access_key = local.rancher_creds.rancher-api-key
secret_key = local.rancher_creds.rancher-api-secret
insecure = true
timeout = "20m"
}
Any ideas?
Hi @bennysp , is your downstream cluster lasting more than 20 minutes to get active?? As mentioned, it seems you are getting a timeout waiting for catalog client.
Error: Creating Catalog V2: Timeout getting Catalog V2 Client at cluster ID c-g49sj: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [dial tcp 127.0.0.1:6080: connect: connection refused] from [https://rancher.local/k8s/clusters/c-g49sj/v1]
@rawmind0 - It takes about 6min for the cluster itself to go into an active state. But that is where I think it is reporting active to terraform before it really is truly active.
I am assuming that if I have the Catalog V2 tied to the cluster_sync cluster ID, it should wait until it is fully ready.
I made this when I was using the 2.5.x version from Ansible because of the similar problem when I was doing V1 catalog, but I am trying to avoid this workaround for V2.
# Due to delay in coming online, wait until it comes online before doing anything else
- name: Rancher Cluster - Wait until cattle-cluster-agent comes online
uri:
method: GET
status_code: 200
url: "{{ uri_workloads }}/deployment:cattle-system:cattle-cluster-agent"
return_content: yes
validate_certs: no
headers:
Authorization: "Bearer {{ rancher_api_key }}:{{ rancher_api_secret }}"
Content-Type: application/json
register: cluster_agent_json
until: cluster_agent_json.json.deploymentStatus.readyReplicas >= 1
retries: 180
delay: 30
- name: Rancher Cluster - Add V1 catalogs
uri:
method: POST
status_code:
- 201
- 409 # Already exists is okay
url: "{{ rancher_url_api }}/catalogs"
body_format: json
body:
branch: "{{ item.branch | default('master') }}"
url: "{{ item.v1_url }}"
name: "{{ item.name }}"
kind: "helm:git"
helmVersion: "helm_v3"
return_content: yes
validate_certs: no
headers:
Authorization: "Bearer {{ rancher_api_key }}:{{ rancher_api_secret }}"
Content-Type: application/json
register: catalogs_json
loop: "{{ catalogs }}"
@rawmind0 I saw a few errors on the nodes themselves, so I increased memory/cpu for the masters and I still see random errors. There may be some other underlying issue. I will look into more of my setup and see if there is issue there.
Thanks
I'm seeing this same problem with Rancher 2.6.11. I'm trying to install cluster-autoscaler app from kuberentes repo.
I have 3 clusters that all install the same helm chart. All clusters exist in AWS. The first time I ran this, it worked on 2 clusters, but failed on the third. I destroyed and recreated and it fails on 2 out 3. If I re-run a plan / apply , it works. My guess is Rancher isn't waiting for the rancher2_catalog_v2 to be fully initialized before trying to install the chart.
I've had similar issues with getting the rancher2_catalog_v2 apps created for the cluster. I ended up using rancher2_cluster_sync to ensure my entire cluster was in an Active state before adding the Catalogs. Now this is happening with rancher2_app_v2.
rancher2_app_v2.c_prod_system_cluster_autoscaler: Creation complete after 28s [id=c-2nmmf.kube-system/cluster-autoscaler]
╷
│ Error: failed to install app v2: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=ServerError, message=failed to find chartName cluster-autoscaler version 9.27.0: NotFound 404] from [https://rancher-ha-jw-mgmt.bylightsdc.bylight.com/k8s/clusters/c-fq9zp/v1/catalog.cattle.io.clusterrepos/kubernetes-autoscaler?action=install]
│
│ with rancher2_app_v2.c_mgmt_system_cluster_autoscaler,
│ on project_apps_v2.tf line 43, in resource "rancher2_app_v2" "c_mgmt_system_cluster_autoscaler":
│ 43: resource "rancher2_app_v2" "c_mgmt_system_cluster_autoscaler" {
│
╵
╷
│ Error: failed to install app v2: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=ServerError, message=failed to find chartName cluster-autoscaler version 9.27.0: NotFound 404] from [https://rancher-ha-jw-mgmt.bylightsdc.bylight.com/k8s/clusters/c-ml77c/v1/catalog.cattle.io.clusterrepos/kubernetes-autoscaler?action=install]
│
│ with rancher2_app_v2.c_int_system_cluster_autoscaler,
│ on project_apps_v2.tf line 110, in resource "rancher2_app_v2" "c_int_system_cluster_autoscaler":
│ 110: resource "rancher2_app_v2" "c_int_system_cluster_autoscaler" {
At this point, I just want a Rancher Ansible collection.
I am running Rancher 2.5.9 and I am using the Terraform Rancher provider version 1.17.2 with Terraform version 1.0.6.
What is happening is that I have several catalogs that install and some that don't. Then on top of that, some apps won't install because the catalog is not fully ready right at that time. When I rerun the Terraform again, it works fine. So, I am suspecting that the catalog that fails is not checking if the api is available with 200 response and the apps are not checking the catalog's status to see if it is ready.
You can see below that I am using the cluster_sync cluster id to make sure it is waiting on the cluster.
Code:
Errors: