kubernetes-sigs / cluster-api-provider-ibmcloud

Cluster API Provider for IBM Cloud
https://cluster-api-ibmcloud.sigs.k8s.io
Apache License 2.0
62 stars 79 forks source link

If transit gateway fails creation in PowerVS then fail CAPI deploy #1653

Closed hamzy closed 1 month ago

hamzy commented 6 months ago

/kind bug /area provider/ibmcloud

What steps did you take and what happened: [A clear and concise description of what the bug is.]

During an IPI CAPI create cluster, a transit gateway is not created. The cluster is useless without this.

What did you expect to happen: Immediate failure.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

{"errors":[{"code":"precondition_failed","message":"cannot add more than 5 gateways to the selected region","more_info":"https://cloud.ibm.com/apidocs/transit-gateway#error-handling"}],"trace":"5261aa71-e822-4340-baef-8c35e6186852"}
E0308 06:25:17.662235 4128998 ibmpowervscluster_controller.go:183]  "msg"="failed to reconcile transit gateway" "error"="error creating transit gateway: cannot add more than 5 gateways to the selected region" "IBMPowerVSCluster"={"name":"rdr-hamzy-test-dal10-58hkl","namespace":"openshift-cluster-api-guests"} "cluster"={"name":"rdr-hamzy-test-dal10-58hkl","namespace":"openshift-cluster-api-guests"} "controller"="ibmpowervscluster" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="IBMPowerVSCluster" "name"="rdr-hamzy-test-dal10-58hkl" "namespace"="openshift-cluster-api-guests" "reconcileID"="3665bd34-7bdb-4785-aae1-a0ed76a199fc"

Environment:

mkumatag commented 6 months ago

@hamzy thanks for reporting an issue, can you please dump more information like complete dump of the IBMPowerVSCluster resource.

@Karthik-K-N are we setting right state for the cluster when error happens? This needs discussion how to fail fast when things go wrong! at least we need have some condition or design how many times do we really want to retry if something gets failed to create

hamzy commented 6 months ago
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-installer]$ oc get ibmpowervscluster -n openshift-cluster-api-guests -o yaml
apiVersion: v1
items:
- apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
  kind: IBMPowerVSCluster
  metadata:
    annotations:
      powervs.cluster.x-k8s.io/create-infra: "true"
    creationTimestamp: "2024-03-08T12:24:43Z"
    finalizers:
    - ibmpowervscluster.infrastructure.cluster.x-k8s.io
    generation: 1
    labels:
      cluster.x-k8s.io/cluster-name: rdr-hamzy-test-dal10-58hkl
    name: rdr-hamzy-test-dal10-58hkl
    namespace: openshift-cluster-api-guests
    ownerReferences:
    - apiVersion: cluster.x-k8s.io/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: Cluster
      name: rdr-hamzy-test-dal10-58hkl
      uid: fd6c490c-c444-48e9-93b9-f573c82b1fb4
    resourceVersion: "436"
    uid: d72f51e2-3b1d-4db6-b89f-17d90525c623
  spec:
    controlPlaneEndpoint:
      host: ""
      port: 0
    cosInstance:
      bucketName: rhcos-powervs-images-us-south
      bucketRegion: us-south
      name: rdr-hamzy-test-dal10-58hkl-cos
    network:
      name: rdr-hamzy-test-dal10-58hkl-network
    resourceGroup:
      name: powervs-ipi-resource-group
    serviceInstance:
      id: 701beea6-d79d-4e8a-8e8a-8d122f3754b6
    serviceInstanceID: ""
    transitGateway:
      name: rdr-hamzy-test-dal10-58hkl-tg
    vpc:
      name: rdr-hamzy-test-dal10-58hkl-vpc
      region: us-south
    zone: dal10
  status:
    conditions:
    - lastTransitionTime: "2024-03-08T12:36:39Z"
      status: "True"
      type: NetworkReady
    - lastTransitionTime: "2024-03-08T12:24:45Z"
      status: "True"
      type: ServiceInstanceReady
    - lastTransitionTime: "2024-03-08T12:25:17Z"
      message: 'error creating transit gateway: cannot add more than 5 gateways to
        the selected region'
      reason: TransitGatewayReconciliationFailed
      severity: Error
      status: "False"
      type: TransitGatewayReady
    - lastTransitionTime: "2024-03-08T12:25:07Z"
      status: "True"
      type: VPCReady
    - lastTransitionTime: "2024-03-08T12:25:12Z"
      status: "True"
      type: VPCSubnetReady
    dhcpServer:
      controllerCreated: true
      id: 48a13744-959e-4c58-b3a1-0e3f5941a475
    network:
      controllerCreated: true
      id: 44e09ab9-b84c-4d70-8ac6-da0612f7e8d0
    ready: false
    resourceGroupID:
      controllerCreated: false
      id: c1cb9b2679344ee9951ab8b4bc22eca0
    vpc:
      controllerCreated: true
      id: r006-c5c1eb58-6685-48d3-a324-1885eafbcae9
    vpcSubnet:
      rdr-hamzy-test-dal10-58hkl-vpcsubnet-us-south-1:
        controllerCreated: true
        id: 0717-f8b6ae0b-d076-44c7-aa59-c60e20a7358b
      rdr-hamzy-test-dal10-58hkl-vpcsubnet-us-south-2:
        controllerCreated: true
        id: 0727-128430a8-69a6-4032-b95d-94ebf4603630
      rdr-hamzy-test-dal10-58hkl-vpcsubnet-us-south-3:
        controllerCreated: true
        id: 0737-ed2ea4cf-0958-4c72-82ee-f4994fb7526c
kind: List
metadata:
  resourceVersion: ""
mkumatag commented 6 months ago

@hamzy as we can see that condition in the status for the TransitGatewayReady is already set as Error which shows something is wrong with the infra and cluster never becomes active.

Considering the way controllers designed it always looks for making that resource available even after the failure in the next retry. Its user's concise decision when to terminate the cluster based on the conditions or go and fix the environment in the backend to proceed the installation flow(e.g: user talking to admin to bump the limit for the transit gateways in this case)

May be having a timeout in the installer with some level of error checking of these conditions will be a better way to deal with such situations.

mkumatag commented 1 month ago

as per above comment closing this issue