rancher / rancher

Complete container management platform
http://rancher.com
Apache License 2.0
23.45k stars 2.98k forks source link

Not possible to deploy RKE2 clusters to Harvester #38147

Closed pandalec closed 2 years ago

pandalec commented 2 years ago

Rancher Server Setup

Information about the Cluster

User Information

Describe the bug

To Reproduce

Result

Expected Result If RKE1 deployment is successful, RKE2 should work too?

Screenshots

Additional context Switched because of this error to a real certificate powered by lets encrypt, same behavior. Machine and DNS is available inside network. Machines are not getting created on Harvester.

admin@rancher:~$ wget https://rancher.*/assets/docker-machine-driver-harvester
--2022-06-30 15:26:55--  https://rancher.*/assets/docker-machine-driver-harvester
Resolving rancher.* (rancher.*)... 10.X.X.X
Connecting to rancher.* (rancher.*)|10.X.X.X|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38760448 (37M) [application/octet-stream]
Saving to: ‘docker-machine-driver-harvester’

docker-machine-driver-harvester                                                               100%[================================================================================================================================================================================================================================================>]  36.96M  --.-KB/s    in 0.04s   

2022-06-30 15:26:55 (878 MB/s) - ‘docker-machine-driver-harvester’ saved [38760448/38760448]

admin@rancher:~$
guangbochen commented 2 years ago

Hi @parsifallo are u using the master-head version? the v2.6.5 works fine for me and we only see this issue with the master-head version because the backend API has made some changes to the cloud-provider part, and it will be resolved with the upcoming UI changes.

pandalec commented 2 years ago

Hi @guangbochen ! I am using Docker image "rancher/rancher:latest", but I see there's a new version. Will give it try now

guangbochen commented 2 years ago

The issue still exists in the latest version but ur first issue description has mentioned it was Rancher version: v2.6.5, just want to make sure it wasn't Rancher v2.6.5, thanks.

pandalec commented 2 years ago

Strange, I copied the version number from the lower left of the web GUI. So I tested explicitly Docker image rancher/rancher:v2.6.6. I can add the Harvester cluster (shown as active) but if I try to create a cluster rancher shows this error message during creation: clusters.management.cattle.io "c-m-mpm4rd4l" not found, same behavior with v2.6.5. Now I started a fresh rancher installation with v2.6.5, added Harvester but I get the same errors:

2022/07/01 07:23:57 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:23:57 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:23:57 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:23:57 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:23:57 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:23:57 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:23:59 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:24:05 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:20 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:20 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:20 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:20 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [INFO] [MachineProvision] Failed to create infrastructure fleet-default/cluster-pool1-303e2776-9dcqh for machine cluster-pool1-6568745bcb-w2txf, deleting and recreating...
2022/07/01 07:26:21 [INFO] [MachineProvision] Failed to create infrastructure fleet-default/cluster-pool1-303e2776-b6s6v for machine cluster-pool1-6568745bcb-4ph9s, deleting and recreating...
2022/07/01 07:26:21 [INFO] [MachineProvision] Failed to create infrastructure fleet-default/cluster-pool1-303e2776-vmxsq for machine cluster-pool1-6568745bcb-8wlb7, deleting and recreating...
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-sqwqwv7m: ClusterUnavailable 503: cluster not found, requeuing
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-sqwqwv7m: ClusterUnavailable 503: cluster not found, requeuing
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-sqwqwv7m: ClusterUnavailable 503: cluster not found, requeuing
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-sqwqwv7m: ClusterUnavailable 503: cluster not found, requeuing
2022/07/01 07:26:21 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-sqwqwv7m: ClusterUnavailable 503: cluster not found, requeuing
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:21 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:22 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-sqwqwv7m: ClusterUnavailable 503: cluster not found, requeuing
2022/07/01 07:26:22 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-sqwqwv7m: ClusterUnavailable 503: cluster not found, requeuing
2022/07/01 07:26:23 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-sqwqwv7m: ClusterUnavailable 503: cluster not found, requeuing
2022/07/01 07:26:26 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-sqwqwv7m: ClusterUnavailable 503: cluster not found, requeuing
2022/07/01 07:26:31 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-sqwqwv7m: ClusterUnavailable 503: cluster not found, requeuing
2022/07/01 07:26:35 [INFO] [planner] rkecluster fleet-default/cluster: waiting: waiting for viable init node
2022/07/01 07:26:41 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-sqwqwv7m: ClusterUnavailable 503: cluster not found, requeuing

Gonna try some other versions

pandalec commented 2 years ago

Downgraded to 2.6.4, same error but this time it is shown in rancher gui:

provisioning bootstrap node(s) cluster1234-pool1-76b5d455d9-n8gtj: failed creating server (HarvesterMachine) in infrastructure provider: CreateError: Downloading driver from https://rancher.*/assets/docker-machine-driver-harvester
Doing /etc/rancher/ssl
ls: cannot access 'docker-machine-driver-*': No such file or directory
downloaded file failed sha256 checksum
download of driver from https://rancher.*/assets/docker-machine-driver-harvester failed, waiting for agent to check in and apply initial plan

Deploying RKE1 still works

Edit: Same error on v2.6.7-rc1. I don't get it. If there's something like a network error, why I am able to deploy RKE1 clusters but no RKE2 clusters? Is it possible to change https://rancher.*/assets/docker-machine-driver-harvester to an internet endpoint which is available from rancher or so?

Screenshot 2022-07-01 at 11 09 30

From inside the rancher Docker container:

rancher:/var/lib/rancher # curl -k https://rancher.*/assets/docker-machine-driver-harvester --output docker-machine-driver-harvester
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 36.9M  100 36.9M    0     0   770M      0 --:--:-- --:--:-- --:--:--  770M
rancher:/var/lib/rancher #
pandalec commented 2 years ago

I guess I found (part of) the issue. I started rancher with --network host instead of defining ports. But it still not works with the current version. Needed to downgrade to 2.6.5 for getting it to work

IDerr commented 2 years ago

I seem to have the same issue but using openstack driver and not harvester, perhaps it's something more global (RKE2 also, rancher 2.6.6)

2022/07/13 11:18:06 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-trbkdb5g: ClusterUnavailable 503: cluster not found, failed to start user controllers for cluster c-m-tctdhvdb: ClusterUnavailable 503: cluster not found, requeuing
2022/07/13 11:18:17 [INFO] [MachineProvision] Failed to create infrastructure fleet-default/test-rancher-3-pool1-15362df3-29wq4 for machine test-rancher-3-pool1-6c8fdfddf-4wqk4, deleting and recreating...
2022/07/13 11:18:17 [INFO] [MachineProvision] Failed to create infrastructure fleet-default/test-rancher-3-pool1-15362df3-29wq4 for machine test-rancher-3-pool1-6c8fdfddf-4wqk4, deleting and recreating...
github-actions[bot] commented 2 years ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

PAzter1101 commented 3 months ago

Hello! I encountered the same problem on Rancher v2.8.5. As a driver - Harvester. Did you manage to solve it?

blackwood821 commented 2 months ago

I'm also experiencing this when trying to create a RKE2 cluster using my custom node driver when running rancher locally on Docker Desktop. It seems to be an issue where download_driver.sh is failing because of the SSL cert (https://github.com/rancher/machine/blob/9183b3ff738e16ece4391a2e6bcc8ef88889e8ae/package/download_driver.sh#L15).

PAzter1101 commented 2 months ago

That didn't seem to help

Hello! As far as I know, the harvester ignores the absence of ssl I was able to solve this problem for myself like this: When installing the rancher, you need to specify the IP address, not the domain name. I don't know exactly why, but when accessing the IP, the harvester manages to successfully log in.

blackwood821 commented 2 months ago

@PAzter1101 I'm actually not using harvester, just using my own custom node driver but I can't seem to figure out how to make download_driver.sh happy with the default self signed SSL cert for running rancher locally on Docker desktop. I would patch the script and add -k just for local testing but I don't know where I can do that since it spins up a new container for the provisioning each time.