nutanix / docker-machine

Rancher Node Driver for Nutanix AHV
https://www.nutanix.com/products/acropolis/virtualization
Mozilla Public License 2.0
13 stars 17 forks source link

Driver does not connect #18

Closed eyanez111 closed 2 years ago

eyanez111 commented 2 years ago

I used:

It got registered on the Rancher server. I am trying to build an RKE1 cluster, I created my Node Template: { "annotations": { "ownerBindingsCreated": "true" }, "baseType": "nodeTemplate", "cloudCredentialId": null, "created": "2021-12-22T00:07:34Z", "createdTS": 1640131654000, "creatorId": "user-xtj9l", "driver": "nutanix", "engineEnv": { }, "engineInstallURL": "https://releases.rancher.com/install-docker/18.09.sh", "engineLabel": { }, "engineOpt": { }, "engineRegistryMirror": [ ], "id": "cattle-global-nt:nt-tphlk", "labels": { "cattle.io/creator": "norman" }, "links": { "nodePools": "…/v3/nodePools?nodeTemplateId=cattle-global-nt%3Ant-tphlk", "nodes": "…/v3/nodes?nodeTemplateId=cattle-global-nt%3Ant-tphlk", "remove": "…/v3/nodeTemplates/cattle-global-nt:nt-tphlk", "self": "…/v3/nodeTemplates/cattle-global-nt:nt-tphlk", "update": "…/v3/nodeTemplates/cattle-global-nt:nt-tphlk" }, "name": "RK1-test", "nutanixConfig": { "cloudInit": "#cloud-config\nusers:\n- name: tony\n sudo: ['ALL=(ALL) NOPASSWD:ALL']\n ssh-authorized-keys:\n - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDDNhhR0Wf4GSz1K5cLdIYPcrKG27irKGgbkzyb3JS/x1irCysGPi9SIj5gChBGNGv99p9gZGPGFgL+CYdXdCORgyT........ "cluster": "NTX-DEV", "diskSize": "0", "endpoint": "ntx-dev.URL.com", "insecure": false, "password": "XXXX", "port": "9440", "storageContainer": "VM", "username": "nutanix_support", "vmCategories": [ ], "vmCores": "1", "vmCpuPassthrough": false, "vmCpus": "2", "vmImage": "CentOS-7-x86_64-GenericCloud-1907", "vmImageSize": "300", "vmMem": "4096", "vmNetwork": [ "Software Development Apps (VLAN 125)" ] }, "principalId": "local://user-xtj9l", "state": "active", "transitioning": "no", "transitioningMessage": "", "type": "nodeTemplate", "useInternalIpAddress": true, "uuid": "4cf59fe2-bb41-4ded-99d7-fb11e527e0f2" }

and I am getting this error: Error creating machine: Error in driver during machine creation: error: {:Timeout waiting for ssh key

is this a problem with the driver? as there is no way for me to add an ssh key when I create a template

tuxtof commented 2 years ago

Hello @eyanez111 no problem to add an ssh key inside template, it works perfectly

the error message don't come from the drivers himself , maybe a communication problem between rancher and the driver

can you give me the following information:

can you activate debug log as specified in this doc => https://rancher.com/docs/rancher/v2.6/en/troubleshooting/logging/

and next share here the entire rancher log during a cluster creation

eyanez111 commented 2 years ago

Hello @tuxtof

 How do I add keys from a Nutanix user? I do not think there is a way to get keys. You just assign passwords , don't you? If you think this is a communication problem, I do not get why it is expecting an SSH key. Where in a node template in Rancher can you put a key... as far as I can see there is no field for that.
tuxtof commented 2 years ago

SSH key is for communication between rancher and vm , they keys was generated automatically by the driver. You have nothing to do on this subject, the only ssh key you can add is inside the cloud-init but this is for your own usage if you want to connect to the VM without using the ssh rancher key.

You need also to verify your template file , not sure all is OK inside , but i don't know your environment.

Once your cluster creation launched, did you see VM in PC ??

Regards

eyanez111 commented 2 years ago

Adding the logs:

2021/12/23 21:10:21 [INFO] Generating and uploading node config worker3
2021/12/23 21:10:21 [INFO] Generating and uploading node config control-plane1
2021/12/23 21:10:21 [INFO] Generating and uploading node config worker1
2021/12/23 21:10:21 [INFO] Generating and uploading node config worker2
2021/12/23 21:10:36 [ERROR] error syncing 'c-m4t6l/m-48mc6': handler node-controller: Error creating machine: Error in driver during machine creation: error: {, requeuing
2021/12/23 21:10:36 [ERROR] error syncing 'c-m4t6l/m-bv8k7': handler node-controller: Error creating machine: Error in driver during machine creation: error: {, requeuing
2021/12/23 21:10:36 [ERROR] error syncing 'c-m4t6l/m-xqh6g': handler node-controller: Error creating machine: Error in driver during machine creation: error: {, requeuing
2021/12/23 21:10:36 [ERROR] error syncing 'c-m4t6l/m-q7hz8': handler node-controller: Error creating machine: Error in driver during machine creation: error: {, requeuing
2021/12/23 21:10:36 [ERROR] error syncing 'c-m4t6l/m-92jp2': handler node-controller: Error creating machine: Error in driver during machine creation: error: {, requeuing
2021/12/23 21:10:36 [ERROR] error syncing 'c-m4t6l/m-8svvp': handler node-controller: Error creating machine: Error in driver during machine creation: error: {, requeuing

Also I am getting on the rancher server: Error creating machine: Error in driver during machine creation: error: {:Timeout waiting for ssh key

adding pic:

Screenshot 2021-12-23 131323
eyanez111 commented 2 years ago

Let me answer this questions here: it seems there is indentation issue in your cloudInit , you can test with an empty cloudInit to verify I did and still the same error and same logs-- Do you want them?

storageContainer need to be a UUID if you ask a second disk but i see the size of the second disk is 0 ?? So you recommend to add a second disk for the cluster infra? I can add one if that would make a change

for endpoint is ntx-dev.URL.com your PC instance ??? yes that is the Prism Central domain we use

for cluster name NTX-DEV is uppercase , expected ???? yes we have it like that in Nutanix

Screenshot 2021-12-23 132602

_you have created a nutanixsupport admin user in PC ??? yes that is a user I created in PC with admin rights for Nutanix support when they want to tunnel in, It is also in PE . Nutanix support uses it all the time and have no problem tunneling in

insecure is set to false, did you set a correct certificate chain for your PC ? I think is set as secure, I have tried both ways and get the same result but how can I verify what is set in PC?

Once your cluster creation launched, did you see VM in PC ?? No, I checked and nothing was created

Thanks in advance. I think I am pretty close

tuxtof commented 2 years ago

Logs seems not in debug mode, did you change the mode ?

eyanez111 commented 2 years ago

I followed the guide you passed me and I did:

$ KUBECONFIG=./kube_config_cluster.yml
$ kubectl -n cattle-system get pods -l app=rancher --no-headers -o custom-columns=name:.metadata.name | while read rancherpod; do kubectl -n cattle-system exec $rancherpod -c rancher -- loglevel --set debug; done
OK
OK
OK
$ kubectl -n cattle-system logs -l app=rancher -c rancher

am I missing anything on the command?

eyanez111 commented 2 years ago

Ok I found a way to get the logs: logs.txt

thanks I think I am almost there... this has been helpful

tuxtof commented 2 years ago

ok it seems better now you have debug entry in the log but i don't see the creation step in this logs (is it the correct time range , is it the combined logs of the three containers ?)

the beginning of the creation in the log need to start with something like

2021/12/24 06:04:45 [INFO] [node-controller-rancher-machine] Docker Machine Version:  v0.15.0-rancher70, build e51aa220
2021/12/24 06:04:45 [INFO] [node-controller-rancher-machine] Found binary path at /var/lib/rancher/management-state/bin/docker-machine-driver-nutanix
2021/12/24 06:04:45 [INFO] [node-controller-rancher-machine] Launching plugin server for driver nutanix
2021/12/24 06:04:45 [INFO] [node-controller-rancher-machine] Plugin server listening at address 127.0.0.1:46631

in all case can you switch the log level to trace so we can have the entire communication because i try to reproduce your error without success since yesterday

you can filter the log on the node-controller-rancher-machine pattern and give me only the corresponding line

tuxtof commented 2 years ago

i just validate the command on a rancher helm install and i get the correct log

kubectl -n cattle-system get pods -l app=rancher --no-headers -o custom-columns=name:.metadata.name | while read rancherpod; do kubectl -n cattle-system exec $rancherpod -c rancher -- loglevel --set trace ; done
kubectl -n cattle-system logs -f -l app=rancher -c rancher 2>&1 | grep node-controller-rancher-machine

and i have all the expected log

2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Docker Machine Version:  v0.15.0-rancher73, build 7766c706
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Found binary path at /var/lib/rancher/management-state/bin/docker-machine-driver-nutanix
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Docker Machine Version:  v0.15.0-rancher73, build 7766c706
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Found binary path at /var/lib/rancher/management-state/bin/docker-machine-driver-nutanix
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Launching plugin server for driver nutanix
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Launching plugin server for driver nutanix
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Plugin server listening at address 127.0.0.1:35779
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Plugin server listening at address 127.0.0.1:37349
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .GetVersion
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Using API Version  1
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .GetVersion
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .SetConfigRaw
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Using API Version  1
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .GetMachineName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .SetConfigRaw
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (flag-lookup) Calling .GetMachineName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .GetMachineName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (flag-lookup) Calling .DriverName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (flag-lookup) Calling .GetMachineName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (flag-lookup) Calling .GetCreateFlags
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (flag-lookup) Calling .DriverName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Found binary path at /var/lib/rancher/management-state/bin/docker-machine-driver-nutanix
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (flag-lookup) Calling .GetCreateFlags
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Launching plugin server for driver nutanix
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Found binary path at /var/lib/rancher/management-state/bin/docker-machine-driver-nutanix
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Launching plugin server for driver nutanix
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Plugin server listening at address 127.0.0.1:35485
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Plugin server listening at address 127.0.0.1:34037
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .GetVersion
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .GetVersion
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Using API Version  1
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Using API Version  1
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .SetConfigRaw
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .SetConfigRaw
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .GetMachineName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] () Calling .GetMachineName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-w1) Calling .GetMachineName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-w1) Calling .DriverName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-m1) Calling .GetMachineName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-w1) Calling .GetCreateFlags
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-m1) Calling .DriverName
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-w1) Calling .GetCreateFlags
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-m1) Calling .GetCreateFlags
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-w1) Calling .SetConfigFromFlags
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-m1) Calling .GetCreateFlags
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Creating CA: /management-state/node/nodes/ze3-w1/certs/ca.pem
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-m1) Calling .SetConfigFromFlags
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Creating client certificate: /management-state/node/nodes/ze3-w1/certs/cert.pem
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Creating CA: /management-state/node/nodes/ze3-m1/certs/ca.pem
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Creating client certificate: /management-state/node/nodes/ze3-m1/certs/cert.pem
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Running pre-create checks...
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-w1) Calling .PreCreateCheck
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-w1) Calling .GetConfigRaw
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] Creating machine...
2021/12/24 07:43:10 [INFO] [node-controller-rancher-machine] (ze3-w1) Calling .Create
2021/12/24 07:43:10 [TRACE] [node-controller-rancher-machine] (ze3-w1) DBG | time="2021-12-24T07:43:10Z" level=info msg="Connecting on: pc.nutanix.com:9440"
eyanez111 commented 2 years ago

I got the logs. I am putting them here: log2.txt

thanks

eyanez111 commented 2 years ago

I left it running for longer in case you needed more info: log3.txt

thanks

tuxtof commented 2 years ago

Hi Take a quick look between oyster and salmon 😎 Issue seems coming from the subnet name Complexity break the search filter

As a temporary fix can you check with a simple subnet name

eyanez111 commented 2 years ago

sorry I am not familiar with subnets. I need to check with the Networking team to provide a simple subnet name for the subnet we are in?

thanks

tuxtof commented 2 years ago

Don't worry I will reproduce it and look to bring a fix soon

Happy Christmas 🎄

tuxtof commented 2 years ago

Hello @eyanez111

Santa Claus 🎅 has just passed, and put a new release (v3.0.1) under the Christmas tree 🎄 It normally fix your issue

i let you test and come back to me

🎄🎄🎄 !!Merry Christmas !!! 🎄🎄🎄

eyanez111 commented 2 years ago

thanks you so much!! so I just have to delete the driver and add:

Download URL: https://github.com/nutanix/docker-machine/releases/download/v3.0.1/docker-machine-driver-nutanix_v3.0.0_linux Custom UI URL: https://nutanix.github.io/rancher-ui-driver/v3.0.1/component.js Whitelist Domains: nutanix.github.io

or am I missing anything?

Merry Christmas!

tuxtof commented 2 years ago

No need to delete, just update the driver and change the download URL Be careful there is two time 3.0.1 in the url

https://github.com/nutanix/docker-machine/releases/download/v3.0.1/docker-machine-driver-nutanix_v3.0.1_linux

The UI don't change and stay in 3.0.0

Cheers 🥂

eyanez111 commented 2 years ago

Hello @tuxtof, thanks for the gift and hope you had nice holidays! I tried our dev cluster and it worked! Now I just tried on our prod cluster and got a different error: Notifying bugsnag: [Error creating machine: Error in driver during machine creation: error: {

Screenshot 2022-01-04 121011

I used the same template just pointed at a different cluster. So I just changed the: Management Endpoint and the Cluster

The rest still is the same. I am attaching the logs: logs-nutanix.txt

thanks for all the help it worked on DEV!

eyanez111 commented 2 years ago

Ah! I kept playing with it and looks like there is a problem with the Additional Disk Size and the Storage Container. What are the limitations if I leave those blank?

Thanks Francisco

tuxtof commented 2 years ago

Hello @eyanez111 , Happy new year

the problem come from how you specify the storage container for the additional disk. You need to give the UUID of the storage container and not the name

Additional Disk is not mandatory, no specific limitation it is just for people who want it

Best Regards