Closed kgeipel-retail7 closed 9 months ago
Hello @kgeipel-retail7 thank you for submitting this bug!
Can you provide me with the values you used for the template such as OS, plan, region, userdata, etc?
Hey @happytreees see the content of the attached file: vultrMasterNodeTemplate.json
"vultrConfig":{ "apiKey":"<OUR_API_KEY>", "appId":"0", "cloudInitUserData":"", "ddosProtection":false, "enableVpc":false, "enabledIpv6":false, "firewallGroupId":"", "floatingIpv4Id":"", "imageId":"", "ipxeChainUrl":"", "isoId":"", "osId":"1743", "region":"fra", "sendActivationEmail":false, "snapshotId":"", "startupScriptId":"", "tags":null, "vpcIds":null, "vpsBackups":false, "vpsPlan":"vhp-4c-8gb-amd" }
I've done some testing on it and there doesn't appear to be any issues directly with the driver itself. I was able to witness this issue when the newly created RKE cluster's agent cannot contact the primary Rancher cluster due to a firewall.
If you are putting any of these resources behind a firewall please ensure that they are all able to speak with each other.
Additionally, it would be helpful to pull the logs from the rancher agent on the new instance. You can generally see this with docker ps
and then use docker logs
to pull those.
Additionally, I recommend using the Vultr Rancher UI as it will ensure that all of the default values are correct: https://github.com/vultr/rancher-ui-driver-vultr
Hey @happytreees thanks for the fast analysis. Yes, I guess it's a firewall topic, but our Rancher has no limitation on outgoing traffic.
As Evan V. pointed out in Ticket: #RCS-91QFV there seems to be a general access limitation to the Vultr compute resources: "I suspect this is because we by default enforce firewall and only allow port 22."
But if Vultr offers compute resources AND a docker-machine driver which is also mentioned to work with Rancher, then the available OS images should be prepared to allow access for the Rancher resources as well, otherwise it's absolutely useless.
Or is it meant that there is the need to configure firewall group rules in the Vultr Management Console? Because by default there is nothing configured, and I thought that then there is no firewall active at all, like other cloud providers handle it.
But section "Required Ports" in the docker-machine drivers Readme says that the firewall is disabled by default by the cloud-init-script of the driver.
Hey @happytreees I now added an init script , which configures UFW on the OS level, to the Vultr cloud console, which I then linked in my node templates.
It seems that the default cloud init config is not applied correctly.
So the initial issue doesn't appear anymore, but I now have another one:
Rancher UI:
Ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain : exit status 1
Rancher Server Log:
2023/11/16 09:31:33 [INFO] Generating and uploading node config vultr-fra-dev-rt7-01-master2 2023/11/16 09:31:33 [DEBUG] [GenericEncryptedStore]: set secret called for mc-m-dlwnj 2023/11/16 09:31:33 [DEBUG] [GenericEncryptedStore]: updating secret mc-m-dlwnj 2023/11/16 09:31:33 [DEBUG] getNodeTemplate parsed [cattle-global-nt:nt-8vlpj] to ns: [cattle-global-nt] and n: [nt-8vlpj] 2023/11/16 09:31:33 [DEBUG] Cleaning up [/opt/jail/c-hb7hh/management-state/node/nodes/vultr-fra-dev-rt7-01-master2] 2023/11/16 09:31:33 [ERROR] error syncing 'c-hb7hh/m-dlwnj': handler node-controller: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain: exit status 1, requeuing 2023/11/16 09:31:33 [DEBUG] [nodepool] bad node found: m-dlwnj
Is there anything else which must be configured in the OS to be able to join the compute resources to an RKE cluster? Because an SSH Key is created during provisioning, I can see it in the Vultr cloud console in the "Account" - "SSH Keys" menu. And I see that there is a key in ~/.ssh/authorized_keys on the provisioned node
Hello @kgeipel-retail7
It does appear that the default script for some reason is not being applied. I am unsure why that is, however, I will look into that.
For reference, we do have information regarding the ports here: https://github.com/vultr/docker-machine-driver-vultr#required-ports
That error looks like a basic SSH authentication issue. I haven't seen that error myself but I will try to figure out some more on my side. Can you share what the outcome is if you use the default userdata script?:
I2Nsb3VkLWNvbmZpZwoKcnVuY21kOgogLSB1ZncgZGlzYWJsZQ==
Hey @happytreees default user data script value also works, I already tried it yesterday, just to be sure there is no issue with my port configuration.
Thanks for digging deeper into it, let me know if you found something or need further information from my side.
Hey @happytreees could you already take a look onto that SSH issue?
This issue will be closed, it's not caused by the driver, the root cause is the used Ubuntu 22.04 LTS image, it seems that there was something changed on the SSH public key authentication method. We observed the same behavior on another cloud provider, a provisioning using Ubuntu 20.04 LTS works out of the box.
Describe the bug Node provisioning using Rancher RKE and Vultr docker-machine driver fails with certificate validation error
To Reproduce
Try to spin up a cluster using Vultr as provider (one node acting as etcd, control plane and worker is enough for testing)
--> Compute resource gets created, OS is successfully installed, SSH connection gets established to install the Docker Runtime but fails afterward, validating the previously copied certs
--> Rancher UI shows error:
Error checking the host: Error checking and/or regenerating the certs: There was an error validating certificates for host "80.240.20.22:2376": dial tcp 80.240.20.22:2376: i/o timeout
--> Rancher Server Log shows error:
2023/11/13 08:51:52 [INFO] [node-controller-rancher-machine] The default lines below are for a sh/bash shell, you can specify the shell you're using, with the --shell flag. 2023/11/13 08:51:52 [INFO] [node-controller-rancher-machine] 2023/11/13 08:51:52 [INFO] Generating and uploading node config vultr-fra-dev-rt7-01-wrk1 2023/11/13 08:51:52 [DEBUG] getNodeTemplate parsed [cattle-global-nt:nt-vq94x] to ns: [cattle-global-nt] and n: [nt-vq94x] 2023/11/13 08:51:52 [DEBUG] Cleaning up [/opt/jail/c-rg7nj/management-state/node/nodes/vultr-fra-dev-rt7-01-wrk1] 2023/11/13 08:51:52 [ERROR] error syncing 'c-rg7nj/m-twnxp': handler node-controller: Error creating machine: Error checking the host: Error checking and/or regenerating the certs: There was an error validating certificates for host "209.250.239.253:2376": dial tcp 209.250.239.253:2376: i/o timeout, requeuing 2023/11/13 08:51:52 [DEBUG] [nodepool] bad node found: m-twnxp
Used Environment:
Additional context Vultr Ticket Number: #RCS-91QFV --> requested to open an issue in this project
vultrMasterNodeTemplate.json vultrRKETemplateRancher259.json vultrRKETemplateRancher271.json