oracle / oci-cloud-controller-manager

Kubernetes Cloud Controller Manager implementation for Oracle Cloud Infrastructure
Apache License 2.0
131 stars 82 forks source link

CCM deletes cluster nodes when dynamic groups and policies are not set properly #434

Open adriengentil opened 11 months ago

adriengentil commented 11 months ago

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

CCM Version: 1.25+

Environment:

What happened?

We deployed the CCM (with useInstancePrincipals: true) into our cluster without setting up dynamic groups and policies in our compartment, as a consequence the cluster nodes were deleted (kubectl get nodes returned no nodes).

This behavior of the CCM made the investigation and the access to the logs complicated as the CCM pods were evicted along with the nodes.

We guess this behavior is not limited to Openshift.

What you expected to happen?

Nodes are left uninitialized, the CCM logs a meaningful message, and retries until the user creates the required policies in OCI.

How to reproduce it (as minimally and precisely as possible)?

Provision a cluster and ensure:

then deploy the CCM with useInstancePrincipals: true config flag. At this time, the CCM should delete the nodes.

Anything else we need to know?

Here are the logs of the CCM pod before it deletes a node:

I0717 15:13:56.437876       1 node_controller.go:415] Initializing node test-infra-cluster-4107b8b3-master-2 with cloud provider
E0717 15:13:56.437954       1 node_controller.go:229] error syncing 'test-infra-cluster-4107b8b3-master-2': failed to get instance metadata for node test-infra-cluster-4107b8b3-master-2: error fetching node by provider ID: compartmentID annotation missing in the node. Would retry, and error by node name: error getting CompartmentID from Node Name: compartmentID annotation missing in the node. Would retry, requeuing
2023-07-17T15:13:56.969Z    ERROR   oci/node_info_controller.go:244 Failed to get instance from instance ID {"component": "cloud-controller-manager", "node": "test-infra-cluster-4107b8b3-master-2", "error": "Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found\nOperation Name: GetInstance\nTimestamp: 2023-07-17 15:13:54 +0000 GMT\nClient Version: Oracle-GoSDK/65.2.0\nRequest Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq\nTroubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.\nAlso see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.\nTo get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.\nIf you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message.", "errorVerbose": "Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found\nOperation Name: GetInstance\nTimestamp: 2023-07-17 15:13:54 +0000 GMT\nClient Version: Oracle-GoSDK/65.2.0\nRequest Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq\nTroubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.\nAlso see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.\nTo get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.\nIf you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message.\ngithub.com/oracle/oci-cloud-controller-manager/pkg/oci/client.(*client).GetInstance\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/oci/client/compute.go:50\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.getInstanceByNode\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:242\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:168\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processNextItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:139\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).runWorker\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:124\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).Run\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571"}
2023-07-17T15:13:56.969Z    ERROR   oci/node_info_controller.go:142 Error processing node test-infra-cluster-4107b8b3-master-2 (will retry): Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found
Operation Name: GetInstance
Timestamp: 2023-07-17 15:13:54 +0000 GMT
Client Version: Oracle-GoSDK/65.2.0
Request Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq
Troubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.
Also see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.
To get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.
If you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message.    {"component": "cloud-controller-manager"}
I0717 15:13:58.998504       1 node_controller.go:415] Initializing node test-infra-cluster-4107b8b3-master-2 with cloud provider
E0717 15:13:58.998590       1 node_controller.go:229] error syncing 'test-infra-cluster-4107b8b3-master-2': failed to get instance metadata for node test-infra-cluster-4107b8b3-master-2: error fetching node by provider ID: compartmentID annotation missing in the node. Would retry, and error by node name: error getting CompartmentID from Node Name: compartmentID annotation missing in the node. Would retry, requeuing
I0717 15:13:59.329292       1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: test-infra-cluster-4107b8b3-master-2
I0717 15:13:59.329476       1 event.go:294] "Event occurred" object="test-infra-cluster-4107b8b3-master-2" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node test-infra-cluster-4107b8b3-master-2 because it does not exist in the cloud provider"
2023-07-17T15:14:03.394Z    ERROR   oci/node_info_controller.go:142 Error processing node test-infra-cluster-4107b8b3-master-0 (will retry): node "test-infra-cluster-4107b8b3-master-0" not found  {"component": "cloud-controller-manager"}