mehdibenfeguir commented 8 months ago

What happened: trying to create a new managed cluster from an existing OKE clustger

What you expected to happen: the new managed cluster to be created

How to reproduce it (as minimally and precisely as possible): I did clusterctl init --infrastructure oci and CRDs were just created fine

running

OCI_COMPARTMENT_ID={my_compartment_id_here} \
OCI_IMAGE_ID={my_ubuntu_image_ocid_here} \
OCI_SSH_KEY=-{path_to_my_private_key_here}  \
CONTROL_PLANE_MACHINE_COUNT=1 \
KUBERNETES_VERSION=v1.27.2 \
NAMESPACE=default \
NODE_MACHINE_COUNT=1 \
clusterctl generate cluster capi-mbf \
--from ~/downloads/cluster-template-managed.yaml | kubectl apply -f -

the file ~/downloads/cluster-template-managed.yamlis fetched from this link https://github.com/oracle/cluster-api-provider-oci/releases/download/v0.14.0/cluster-template-managed.yaml

results into these errors

New clusterctl version available: v1.6.0 -> v1.6.1
sigs.k8s.io/cluster-api
cluster.cluster.x-k8s.io/capi-mbf configured
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "default.ocimanagedcluster.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capoci-webhook-service.cluster-api-provider-oci-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-ocimanagedcluster?timeout=10s": EOF
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "default.ocimanagedcontrolplane.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capoci-webhook-service.cluster-api-provider-oci-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-ocimanagedcontrolplane?timeout=10s": EOF
Error from server (Forbidden): error when creating "STDIN": admission webhook "validation.machinepool.cluster.x-k8s.io" denied the request: spec: Forbidden: can be set only if the MachinePool feature flag is enabled
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "default.ocimanagedmachinepool.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capoci-webhook-service.cluster-api-provider-oci-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-ocimanagedmachinepool?timeout=10s": EOF

Anything else we need to know?: Anyone could help me to know the exact issue ? Environment:

CAPOCI version: latest
Cluster-API version (use clusterctl version): v1.6.0
Kubernetes version (use kubectl version): Client Version: v1.28.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.27.2
Docker version (use docker info): Client: Docker Engine - Community Version: 24.0.6
OS (e.g. from /etc/os-release):

shyamradhakrishnan commented 8 months ago

@mehdibenfeguir The error shows below ```Error from server (Forbidden): error when creating "STDIN": admission webhook "validation.machinepool.cluster.x-k8s.io" denied the request: spec: Forbidden: can be set only if the MachinePool feature flag is enabled


Please enable machinepool flag before doing clusterctl init --infrastructure oci command
Please see doc https://oracle.github.io/cluster-api-provider-oci/managed/managedcluster.html#environment-variables

You will have to do clusterctl delete --all and then reinitalize after exporting the variable.

mehdibenfeguir commented 8 months ago

ok I did that, the MachinePool error is gone but I'm still getting these errors

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "default.ocimanagedcluster.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capoci-webhook-service.cluster-api-provider-oci-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-ocimanagedcluster?timeout=10s": EOF
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "default.ocimanagedcontrolplane.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capoci-webhook-service.cluster-api-provider-oci-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-ocimanagedcontrolplane?timeout=10s": EOF
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "default.ocimanagedmachinepool.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capoci-webhook-service.cluster-api-provider-oci-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-ocimanagedmachinepool?timeout=10s": EOF

shyamradhakrishnan commented 8 months ago

Are the CAPOCI pods running properly? Can you check the output of kubectl get pods -n cluster-api-provider-oci-system and see if the capoci pods are running fine? If not can you check the logs using kubectl logs command to see why they are not running fine?

mehdibenfeguir commented 8 months ago

k logs capoci-controller-manager-5648c768-mpjwb -n cluster-api-provider-oci-system

I0118 10:36:42.416442       1 main.go:240] "setup: CAPOCI Version" version="v0.14.0"
E0118 10:36:42.416470       1 main.go:249] "setup: unable to get OCI region from AuthConfigProvider" err="region can not be empty or have spaces"

how should I provide the region to the management cluster I'm doing export OCI_REGION=me-jeddah-1

shyamradhakrishnan commented 8 months ago

Please follow the instructions here https://oracle.github.io/cluster-api-provider-oci/gs/install-cluster-api.html#install-cluster-api-provider-for-oracle-cloud-infrastructure to give the details, if you are using OKE as management cluster, "Instance Principals" are recommended for production, although user principal may be easier to start with.

mehdibenfeguir commented 8 months ago

yes I'm using the exact same config and I'm using export OCI_REGION=me-jeddah-1 and it's still complaining about the region

I0118 10:36:42.416442       1 main.go:240] "setup: CAPOCI Version" version="v0.14.0"
E0118 10:36:42.416470       1 main.go:249] "setup: unable to get OCI region from AuthConfigProvider" err="region can not be empty or have spaces"

shyamradhakrishnan commented 8 months ago

did you execute this step as well? export OCI_REGION_B64="$(echo -n "$OCI_REGION" | base64 | tr -d '\n')"

mehdibenfeguir commented 8 months ago

but it's conditional, so I didn't

if Passphrase is present

Let me add it

mehdibenfeguir commented 8 months ago

ok now complaining about private key even if I provided it like that export OCI_CREDENTIALS_KEY_B64=$(base64 < ~/.ssh/id_rsa | tr -d '\n') I tried also to echo $OCI_CREDENTIALS_KEY_B64 and it's showing the encrypted content

k logs capoci-controller-manager-5648c768-kbm4m  -n cluster-api-provider-oci-system
I0118 11:15:42.408639       1 main.go:240] "setup: CAPOCI Version" version="v0.14.0"
E0118 11:15:42.408775       1 clients.go:188]  "msg"="unable to create OCI VCN Client" "error"="can not create client, bad configuration: failed to parse private key"
E0118 11:15:42.408798       1 main.go:261] "setup: authentication provider could not be initialised" err="can not create client, bad configuration: failed to parse private key"

shyamradhakrishnan commented 8 months ago

The private key is not he ssh private key, it should be the oci private key, please go through the doc https://docs.oracle.com/en-us/iaas/Content/API/Concepts/apisigningkey.htm and create a private key in ~/.oci folder and provide that path

mehdibenfeguir commented 8 months ago

ok thanks all issues has been fixed and I was able to create the cluster with no errors

sigs.k8s.io/cluster-api
cluster.cluster.x-k8s.io/capi-mbf configured
ocimanagedcluster.infrastructure.cluster.x-k8s.io/capi-mbf created
ocimanagedcontrolplane.infrastructure.cluster.x-k8s.io/capi-mbf created
machinepool.cluster.x-k8s.io/capi-mbf-mp-0 configured
ocimanagedmachinepool.infrastructure.cluster.x-k8s.io/capi-mbf-mp-0 created

but the cluster is not created

k describe cluster capi-mbf
Name:         capi-mbf
Namespace:    default
Labels:       cluster.x-k8s.io/cluster-name=capi-mbf
Annotations:  <none>
API Version:  cluster.x-k8s.io/v1beta1
Kind:         Cluster
Metadata:
  Creation Timestamp:  2024-01-18T09:12:28Z
  Finalizers:
    cluster.cluster.x-k8s.io
  Generation:        8
  Resource Version:  328221976
  UID:               b59c366f-1aa9-4aa7-8324-652880918aec
Spec:
  Control Plane Endpoint:
    Host:
    Port:  0
  Control Plane Ref:
    API Version:  infrastructure.cluster.x-k8s.io/v1beta1
    Kind:         OCIManagedControlPlane
    Name:         capi-mbf
    Namespace:    default
  Infrastructure Ref:
    API Version:  infrastructure.cluster.x-k8s.io/v1beta1
    Kind:         OCIManagedCluster
    Name:         capi-mbf
    Namespace:    default
Status:
  Conditions:
    Last Transition Time:  2024-01-18T11:36:49Z
    Reason:                WaitingForControlPlane
    Severity:              Info
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-01-18T11:36:49Z
    Message:               Waiting for control plane provider to indicate the control plane has been initialized
    Reason:                WaitingForControlPlaneProviderInitialized
    Severity:              Info
    Status:                False
    Type:                  ControlPlaneInitialized
    Last Transition Time:  2024-01-18T11:36:49Z
    Reason:                WaitingForControlPlane
    Severity:              Info
    Status:                False
    Type:                  ControlPlaneReady
    Last Transition Time:  2024-01-18T11:36:49Z
    Reason:                WaitingForInfrastructure
    Severity:              Info
    Status:                False
    Type:                  InfrastructureReady
  Observed Generation:     8
  Phase:                   Provisioning
Events:                    <none>

shyamradhakrishnan commented 8 months ago

The control plane has not been created, you can either describe the ocimanagedcontrolplane object, or look at the capoci logs(preferable)

mehdibenfeguir commented 8 months ago

the logs shows this, but I provided the pem key and I'm authenticated when I run this command I get the list oci iam region list --config-file /Users/mehdibenfeguir/.oci/config --profile mehdi --auth security_token

E0118 11:44:08.554708       1 controller.go:329] "Reconciler error" err=<
    Error returned by ContainerEngine Service. Http Status Code: 401. Error Code: NotAuthenticated. Opc request id: 662ee8794ff90c3fd7213fb040eb6cc7/E18DCD24600BBA005B143903C85449B4/72D3DA271E8E93B70DFE2A58AA0BD9D8. Message: Failed to verify the HTTP(S) Signature

shyamradhakrishnan commented 8 months ago

Definitely a problem with your pem key, did you create a pem key and upload it to OCI console as explained in the doc? in the command prompt, you are using security token, not private key.

mehdibenfeguir commented 8 months ago

ok so now it's a different error

E0118 12:07:03.792027       1 vcn_reconciler.go:101] "failed to list vcn by name" err=<
    Error returned by VirtualNetwork Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: 7f6846f93cdf9778d3c1ac8bad1b7649/B7CE5A411106A9BA93DE717DB3FF8BAB/5B493160620374EB44402EE936D865DF. Message: Authorization failed or requested resource not found.

is it supposed to create a new VCN or check for an existing one ?

shyamradhakrishnan commented 8 months ago

It will create a new VCN, have you added the necessary policies to the user? please add policies mentioned here https://oracle.github.io/cluster-api-provider-oci/gs/iam/iam-oke.html

you can also verify with oci cli if you are able to list VCN etc.

mehdibenfeguir commented 8 months ago

ok policies added and the cluster was created with 0 nodepools checking the logs I'm getting this

failed to create OCIManagedMachinePool: Error returned by ContainerEngine Service. Http Status Code: 400. Error Code: InvalidParameter. Opc request id: 338c92f544911fd20e94c9a39d6c5550/91FED560B24118FF7A883374A1A55790/1D5EF9F8F8A2E126712DA1DD568A3BF8. Message: Invalid sshPublicKey: Provided key is not a valid OpenSSH public key. Operation Name: CreateNodePool Timestamp: 2024-01-18 12:55:49 +0000 GMT Client Version: Oracle-GoSDK/65.45.0 Request Endpoint: POST https://containerengine.me-jeddah-1.oci.oraclecloud.com/20180222/nodePools Troubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_400__400_invalidparameter for more information about resolving this error. Also see https://docs.oracle.com/iaas/api/#/en/containerengine/20180222/NodePool/CreateNodePool for details on this operation's requirements. To get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details. If you are unable to resolve this ContainerEngine issue, please contact Oracle support and provide them this full error message.

I added the public key provided from the OCI console the passed it as an environment variable when creating the cluster am I doing anything wrong ?

OCI_COMPARTMENT_ID={oci_compartment_id} \
OCI_IMAGE_ID={ocid_image_id}\
OCI_SSH_KEY=/Users/mehdibenfeguir/downloads/public.pem  \  #This file was downloaded from OCI console
CONTROL_PLANE_MACHINE_COUNT=1 \
KUBERNETES_VERSION=v1.27.2 \
NAMESPACE=default \
NODE_MACHINE_COUNT=1 \
clusterctl generate cluster capi-mbf \
--from  /Users/mehdibenfeguir/downloads/cluster-template-managed.yaml | kubectl apply -f -

shyamradhakrishnan commented 8 months ago

When you create a managed node pool, the key which has to be provided in the managed node pool params is the SSH key, not the OCI key. This SSH key will be used to SSH to the machines.

mehdibenfeguir commented 8 months ago

oh so I need to provide my personal public ssh key great let me try

mehdibenfeguir commented 8 months ago

it's working !! Thank you very much @shyamradhakrishnan for the precious help I suggest to enhance docs especially for the ssh keys it's a little bit confusing

Sorry but another last question when I want to cleanup and run the command kubectl delete cluster {cluster_name} the cluster gets deleted but not the vcn, is there any possible solution to automate the cleanup ?

shyamradhakrishnan commented 8 months ago

the VCN should be deleted if you deleted the cluster using kubectl delete cluster, you can verify the logs, maybe it is an older VCN? Or maybe there was an error during delete of VCN?

mehdibenfeguir commented 8 months ago

these are the logs failed to delete subnet: Error returned by VirtualNetwork Service. Http Status Code: 409. Error Code: Conflict. Opc request id: 843868a7e1f565e0b1de0d6e66339f8a/EE2B2536FECCA3EF5F12F5208F13456F/43E7DE93F48B26EFF2F5E0F91FAEECDD. Message: The Subnetxxxx references the VNIC xxx. You must remove the reference to proceed with this operation.

shyamradhakrishnan commented 8 months ago

did you create an LB service, or anything in the cluster? Are all the compute instances deleted?

mehdibenfeguir commented 8 months ago

I just did this, and yes all the comupte instances were deleted fine

OCI_COMPARTMENT_ID=xxx \
OCI_IMAGE_ID=xxx \
OCI_SSH_KEY="$(cat /Users/mehdibenfeguir/.ssh/id_rsa.pub)"  \
CONTROL_PLANE_MACHINE_COUNT=1 \
KUBERNETES_VERSION=v1.28.2 \
NAMESPACE=default \
NODE_MACHINE_COUNT=1 \
clusterctl generate cluster capi-mbf \
--from  /Users/mehdibenfeguir/downloads/cluster-template-managed.yaml | kubectl apply -f -

shyamradhakrishnan commented 8 months ago

hmm, the error clearly shows there is a VNIC resource attached to the subnet, because of which the subnet could not be deleted. So you did not create any pod, or any resource in the cluster? And you deleted the cluster using kubectl delete cluster command? What is the name of the subnet which could not be deleted?

mehdibenfeguir commented 7 months ago

So you did not create any pod, or any resource in the cluster? yes

And you deleted the cluster using kubectl delete cluster command? yes

What is the name of the subnet which could not be deleted? the subnet that includes the name of capi-mbf

shyamradhakrishnan commented 7 months ago

It would be great if you can provide the full name of the subnet, can you execute the command oci network vnic get, and see what the VNIC is attached to? Sorry, ideally if the delete reaches subnet, it should have deleted all resources, so unless you see any other errors in the logs, we will have to debug this more.

mehdibenfeguir commented 7 months ago

I already removed it manually so I can't check the exact name now

shyamradhakrishnan commented 7 months ago

Thanks, Can we close this ticket and you can create a new one when you notice it again?

mehdibenfeguir commented 7 months ago

ok thanks for helping

oracle / cluster-api-provider-oci

failed calling webhook error #345

if Passphrase is present