too many cloud api calls in node-update-controller

yussufsh commented 1 year ago

/kind bug /kind enhancement

What happened? There are lots of API calls in node-update-controller which creates the powervs cloud object where some fails.

In a minute, a total of ~13 calls to create a cloud object and calls GET pvm instance for checking and setting the storage affinity policy.

# oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep 'I0821 05:21' | wc -l
27

See the example below where a few errors are while fetching the pvm instance. The last one is while getting the powervs client object (which is fatal) and suggesting the container restart (See #441 ).

Examples:

# oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep -v 'StoragePoolAffinity' | grep -v 'PROVIDER-ID'
2023-08-19T02:42:36Z    INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8081"}
2023-08-19T02:42:36Z    INFO    setup   starting manager
2023-08-19T02:42:36Z    INFO    Starting server {"kind": "health probe", "addr": "[::]:8082"}
2023-08-19T02:42:36Z    INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8081"}
2023-08-19T02:42:36Z    INFO    Starting EventSource    {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "source": "kind source: *v1.Node"}
2023-08-19T02:42:36Z    INFO    Starting Controller     {"controller": "node", "controllerGroup": "", "controllerKind": "Node"}
2023-08-19T02:42:36Z    INFO    Starting workers        {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "worker count": 1}
I0819 05:54:42.543016       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError  &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:}
I0820 06:54:24.914454       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError  &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:}
I0820 17:30:31.360402       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][403] pcloudPvminstancesGetForbidden  &{Code:403 Description: Error: Message:user iam-ServiceId-c27c3ef5-8405-4dc1-9590-4440adaad19f does not have correct permissions to access crn:v1:bluemix:public:power-iaas:lon06:a/bf9f1f230466481b95a99f18739fede9:dbc67d5e-9579-49da-b1d9-fc2ec7ddc680:: with {role:user-unauthorized permissions (read:false write:false manage:false)}}
F0821 05:22:32.216618       1 powervs_node.go:69] Failed to get powervs cloud: errored while getting the Power VS service instance with ID: dbc67d5e-9579-49da-b1d9-fc2ec7ddc680, err: Get "https://resource-controller.cloud.ibm.com/v2/resource_instances/dbc67d5e-9579-49da-b1d9-fc2ec7ddc680": read tcp 192.168.81.10:46226->104.102.54.251:443: read: connection reset by peer

What you expected to happen? The node-update-controller should not have so many cloud API calls.

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version):
Driver version: latest

k8s-ci-robot commented 1 year ago

@yussufsh: The label(s) kind/enhancement cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubernetes-sigs/ibm-powervs-block-csi-driver/issues/442): >/kind bug >/kind enhancement > >**What happened?** >There are lots of API calls in node-update-controller which creates the powervs cloud object where some fails. > >In a minute, a total of ~13 calls to create a cloud object and calls GET pvm instance for checking and setting the storage affinity policy. >``` ># oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep 'I0821 05:21' | wc -l >27 >``` > >See the example below where a few errors are while fetching the pvm instance. The last one is while getting the powervs client object (which is fatal) and suggesting the container restart (See #441 ). > > >Examples: >``` ># oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep -v 'StoragePoolAffinity' | grep -v 'PROVIDER-ID' >2023-08-19T02:42:36Z INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8081"} >2023-08-19T02:42:36Z INFO setup starting manager >2023-08-19T02:42:36Z INFO Starting server {"kind": "health probe", "addr": "[::]:8082"} >2023-08-19T02:42:36Z INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8081"} >2023-08-19T02:42:36Z INFO Starting EventSource {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "source": "kind source: *v1.Node"} >2023-08-19T02:42:36Z INFO Starting Controller {"controller": "node", "controllerGroup": "", "controllerKind": "Node"} >2023-08-19T02:42:36Z INFO Starting workers {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "worker count": 1} >I0819 05:54:42.543016 1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:} >I0820 06:54:24.914454 1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:} >I0820 17:30:31.360402 1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][403] pcloudPvminstancesGetForbidden &{Code:403 Description: Error: Message:user iam-ServiceId-c27c3ef5-8405-4dc1-9590-4440adaad19f does not have correct permissions to access crn:v1:bluemix:public:power-iaas:lon06:a/bf9f1f230466481b95a99f18739fede9:dbc67d5e-9579-49da-b1d9-fc2ec7ddc680:: with {role:user-unauthorized permissions (read:false write:false manage:false)}} >F0821 05:22:32.216618 1 powervs_node.go:69] Failed to get powervs cloud: errored while getting the Power VS service instance with ID: dbc67d5e-9579-49da-b1d9-fc2ec7ddc680, err: Get "https://resource-controller.cloud.ibm.com/v2/resource_instances/dbc67d5e-9579-49da-b1d9-fc2ec7ddc680": read tcp 192.168.81.10:46226->104.102.54.251:443: read: connection reset by peer >``` > >**What you expected to happen?** >The node-update-controller should not have so many cloud API calls. > >**How to reproduce it (as minimally and precisely as possible)?** > >**Anything else we need to know?**: > >**Environment** >- Kubernetes version (use `kubectl version`): >- Driver version: latest Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

yussufsh commented 1 year ago

/assign @yussufsh One solution could be to add a node label as soon as we set the Storage Affinity Policy to false on the pvm instance. Subsequent reconcile calls should check if a particular node has that label. If the label is present no need to call cloud APIs.

k8s-triage-robot commented 10 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

yussufsh commented 10 months ago

/remove-lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

yussufsh commented 7 months ago

/remove-lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

yussufsh commented 3 months ago

/remove-lifecycle stale

k8s-triage-robot commented 3 days ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes-sigs / ibm-powervs-block-csi-driver

too many cloud api calls in node-update-controller #442