kubernetes-sigs / cluster-api-provider-ibmcloud

Cluster API Provider for IBM Cloud
https://cluster-api-ibmcloud.sigs.k8s.io
Apache License 2.0
62 stars 84 forks source link

Cluster incorrectly shown as ready when both LoadBalancers are not ready #2026

Closed hamzy closed 3 weeks ago

hamzy commented 4 weeks ago

/kind bug /area provider/ibmcloud

What steps did you take and what happened: Deploy a 4.18 cluster on a PowerVS zone where LoadBalancers are slow to create. We are called with InfraReady. We then create DNS records for the LBs. However, only the public LB exists. So the cluster fails to deploy.

What did you expect to happen: We should wait for all specified LBs to become ready.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Karthik-K-N commented 4 weeks ago

I think currently we check public loadbalancer and set hostname accordingly, May be we need to look for all the configured loadbalancers status before setting infra as ready. https://github.com/kubernetes-sigs/cluster-api-provider-ibmcloud/blob/main/controllers/ibmpowervscluster_controller.go#L334-L346.

mkumatag commented 4 weeks ago

@Karthik-K-N can someone fix this behaviour asap?

Karthik-K-N commented 4 weeks ago

@Karthik-K-N can someone fix this behaviour asap?

sure.

Karthik-K-N commented 4 weeks ago

I think currently we check public loadbalancer and set hostname accordingly, May be we need to look for all the configured loadbalancers status before setting infra as ready. https://github.com/kubernetes-sigs/cluster-api-provider-ibmcloud/blob/main/controllers/ibmpowervscluster_controller.go#L334-L346.

@dharaneeshvrd Just looking at the code we already check all the loadbalancer status here in ReconcileLoadbalancer https://github.com/Karthik-K-N/cluster-api-provider-ibmcloud/blob/8a563a0620f34ecb772a73bccb9ed7f1384c822f/cloud/scope/powervs_cluster.go#L1977-L1979,

when do you think this issue will occur.

Karthik-K-N commented 4 weeks ago

We did some investigation on the code and its working as expected

  1. For all the LB configured for the cluster, We requue if the LB status is not active, For reference code.
  2. We set infra.Ready only after LB are ready, For reference code.

@hamzy are you running with latest code from this repo? If you still face issues with latest code, Please help us with controller logs for further debugging. Thanks

Karthik-K-N commented 4 weeks ago

/triage needs-information

hamzy commented 4 weeks ago

We are running version sigs.k8s.io/cluster-api-provider-ibmcloud v0.9.0-alpha.0.0.20240913094112-c6bcd313bce0

https://github.com/openshift/installer/blob/master/cluster-api/providers/ibmcloud/go.mod#L7

hamzy commented 4 weeks ago

Version sigs.k8s.io/cluster-api-provider-ibmcloud v0.9.0-beta.0.0.20241017140904-8a563a0620f3 in https://github.com/openshift/installer/pull/9118 also fails.

Karthik-K-N commented 4 weeks ago

Could you please share the error and the code reference.

hamzy commented 4 weeks ago

This is a failing run of #9118 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/9118/pull-ci-openshift-installer-master-e2e-powervs-capi-ovn/1849180644286926848/artifacts/e2e-powervs-capi-ovn/ipi-install-powervs-install/artifacts/clusterapi_output/

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/9118/pull-ci-openshift-installer-master-e2e-powervs-capi-ovn/1849180644286926848/artifacts/e2e-powervs-capi-ovn/ipi-install-powervs-install/artifacts/.openshift_install.log

Search for "InfraReady: hostname =" to see there is only one LB active at the time although the other one does become active eventually.

mkumatag commented 4 weeks ago

@Karthik-K-N @hamzy let us setup a call on Monday and sort out this issue.

Karthik-K-N commented 4 weeks ago

InfraReady: hostname

Thanks for the reference. I will check and update more here. Looking at IBMPowerVSCluster resource from here. Seems like something is wrong, The LB is create_pending but Loadbalancer is set ready in conditions.

hamzy commented 4 weeks ago

FYI I pass in two LBs here: https://github.com/openshift/installer/blob/master/pkg/asset/manifests/powervs/cluster.go#L135-L162

So why under status, does it only have one?

  loadbalancers:
    p-mad02-2-capi-master-qwb48-loadbalancer:
      id: r050-70cc8d92-60fa-4ed1-9c09-11f0dd4d3d6a
      state: create_pending
      hostname: 70cc8d92-eu-es.lb.appdomain.cloud
      controllercreated: true
hamzy commented 4 weeks ago

@Karthik-K-N @hamzy let us setup a call on Monday and sort out this issue.

Sure, we could but it seems like the discussion on this PR seems to be sufficient?

hamzy commented 3 weeks ago

I wrote a test PR: https://github.com/openshift/installer/pull/9145

First run logs at: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/9145/pull-ci-openshift-installer-master-e2e-powervs-capi-ovn/1850937938343366656/artifacts/e2e-powervs-capi-ovn/ipi-install-powervs-install/artifacts/clusterapi_output/

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/9145/pull-ci-openshift-installer-master-e2e-powervs-capi-ovn/1850937938343366656/artifacts/e2e-powervs-capi-ovn/ipi-install-powervs-install/artifacts/clusterapi_output/IBMPowerVSCluster-openshift-cluster-api-guests-p-mad02-1-capi-master-pw5lj.yaml

shows

  loadbalancers:
    p-mad02-1-capi-master-pw5lj-loadbalancer:
      id: r050-50fb5ff1-a7e0-432d-bc1c-e4848ecd461b
      state: active
      hostname: 50fb5ff1-eu-es.lb.appdomain.cloud
      controllercreated: true
    p-mad02-1-capi-master-pw5lj-loadbalancer-int:
      id: r050-92f04537-ce19-4bce-9e8c-ab84c4632c4b
      state: active
      hostname: 92f04537-eu-es.lb.appdomain.cloud
      controllercreated: true

and

  - type: LoadBalancerReady
    status: "True"
    severity: ""
    lasttransitiontime: "2024-10-28T17:48:55Z"
    reason: ""
    message: ""
Karthik-K-N commented 3 weeks ago

I wrote a test PR: openshift/installer#9145

First run logs at: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/9145/pull-ci-openshift-installer-master-e2e-powervs-capi-ovn/1850937938343366656/artifacts/e2e-powervs-capi-ovn/ipi-install-powervs-install/artifacts/clusterapi_output/

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/9145/pull-ci-openshift-installer-master-e2e-powervs-capi-ovn/1850937938343366656/artifacts/e2e-powervs-capi-ovn/ipi-install-powervs-install/artifacts/clusterapi_output/IBMPowerVSCluster-openshift-cluster-api-guests-p-mad02-1-capi-master-pw5lj.yaml

shows

  loadbalancers:
    p-mad02-1-capi-master-pw5lj-loadbalancer:
      id: r050-50fb5ff1-a7e0-432d-bc1c-e4848ecd461b
      state: active
      hostname: 50fb5ff1-eu-es.lb.appdomain.cloud
      controllercreated: true
    p-mad02-1-capi-master-pw5lj-loadbalancer-int:
      id: r050-92f04537-ce19-4bce-9e8c-ab84c4632c4b
      state: active
      hostname: 92f04537-eu-es.lb.appdomain.cloud
      controllercreated: true

and

  - type: LoadBalancerReady
    status: "True"
    severity: ""
    lasttransitiontime: "2024-10-28T17:48:55Z"
    reason: ""
    message: ""

Awesome, Thank you for verifying this. I think we should be good to merge this.