rancher / quickstart

381 stars 338 forks source link

GCP: Stucking with Provisioning of quickstart-gcp-custom #190

Closed wofr closed 2 years ago

wofr commented 2 years ago

I followed the instruction for a cloud-quickstart on Google Cloud. The installation with terraform went fine (ressoures succefully provided on gcp) and I also could open the ranger-api.

Unfortunatly the quickstart-gcp-custom cluster never leaves the provisining state (see screenshot) grafik

Also I'm not able to add existing cluster what makes sense if etcd and the controllplane is not available.
My assumption was etcd would be ramped up the scripts, but maybe I got something wrong.
Any help welcome :)

bashofmann commented 2 years ago

I just tested this, and it works correctly for me. How long did you wait? When terraform apply finished, it will take a couple of minutes until the provisioning of the cluster finished. Are there maybe some restrictions in your GCP account that would restrict connectivity between the the VM of the cluster and the Rancher API?

wofr commented 2 years ago

I waited for for thean 40 minutes (can be seen in the screenshot) . Frankly speaking I do not think I have any restrictions in place, at least I never run into some so far by using helm or kubectl.
Nevertheless I will give it a retry to see I've a got more details.

wofr commented 2 years ago

Today I tried it again. The terraform scripts complets with any error. On the GCP I see two nodes running "quickstart-node" and "quickstart-rancher-server" so everything looks ok.

But in the UI it is still telling me "Waiting for etcd, controlplane and worker nodes to be registered".

bashofmann commented 2 years ago

Can you ssh into the quickstart-node node and check, what Docker containers are running there docker ps? Normally there should be running containers for etcd, kubernetes-api etc. But there should be at least a rancher-agent container. If this exists, the logs may give more information on why the cluster is not being created.

wofr commented 2 years ago

Seems docker is not running, on the quickstart-node ("Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?") moreover I did not find a racher-agent running on the quickstart-node

The good news is that the the server is running fine. I was able to add an exiting GKE cluster to the server and this works like charming.

bashofmann commented 2 years ago

Did you try running this with sudo? The Docker Daemon is likely only accessible as root.

On GCP Docker is enabled and the node is registered with a startup script: https://github.com/rancher/quickstart/blob/master/gcp/files/userdata_quickstart_node.template

Can you check the logs of the startup script (sudo journalctl -u google-startup-scripts.service).

wofr commented 2 years ago

wolfgang_a_friedl@quickstart-rancher-server: sudo journalctl -u google-startup-scripts.service -- Logs begin at Thu 2021-12-02 11:57:11 UTC, end at Fri 2021-12-03 09:14:03 UTC. -- Dec 02 11:58:04 quickstart-rancher-server systemd[1]: Starting Google Compute Engine Startup Scripts... Dec 02 11:58:04 quickstart-rancher-server GCEMetadataScripts[2883]: 2021/12/02 11:58:04 GCEMetadataScripts: Starting startup scripts (version 20210414.00). Dec 02 11:58:04 quickstart-rancher-server GCEMetadataScripts[2883]: 2021/12/02 11:58:04 GCEMetadataScripts: No startup scripts to run. Dec 02 11:58:04 quickstart-rancher-server systemd[1]: google-startup-scripts.service: Succeeded. Dec 02 11:58:04 quickstart-rancher-server systemd[1]: Finished Google Compute Engine Startup Scripts. wolfgang_a_friedl@quickstart-rancher-server:~>

I "used sudo docker ps" still saying no deamon running

bashofmann commented 2 years ago

That's weird. There definitely is a startup script: https://github.com/rancher/quickstart/blob/master/gcp/infra.tf#L129-L137

Just to double check: You are using latest master? Which terraform version and provider versions are you using?

For me:

~/dev/src/github.com/rancher/quickstart/gcp ❯ terraform version
Terraform v1.0.11
on darwin_amd64
+ provider registry.terraform.io/banzaicloud/k8s v0.8.2
+ provider registry.terraform.io/hashicorp/google v3.83.0
+ provider registry.terraform.io/hashicorp/helm v2.3.0
+ provider registry.terraform.io/hashicorp/local v2.1.0
+ provider registry.terraform.io/hashicorp/tls v3.1.0
+ provider registry.terraform.io/invidian/sshcommand v0.2.2
+ provider registry.terraform.io/rancher/rancher2 v1.17.2
+ provider registry.terraform.io/rancher/rke v1.2.2
wofr commented 2 years ago

PS C:\Repo\Rancher\gcp> terraform version Terraform v1.0.11 on windows_amd64 + provider registry.terraform.io/banzaicloud/k8s v0.8.2 + provider registry.terraform.io/hashicorp/google v3.83.0 + provider registry.terraform.io/hashicorp/helm v2.3.0 + provider registry.terraform.io/hashicorp/local v2.1.0 + provider registry.terraform.io/hashicorp/tls v3.1.0 + provider registry.terraform.io/invidian/sshcommand v0.2.2 + provider registry.terraform.io/rancher/rancher2 v1.17.2 + provider registry.terraform.io/rancher/rke v1.2.2

Is used the master/latest and I could confirm the "metadata_startup_script" is also part of the infra.tf I used. Could the problem be that I used windows for the installation?! Maybe terraform has skipped something, and I didn't see it in the logs.

bashofmann commented 2 years ago

Maybe it's a forward slash/backward slash problem in the file path reference to the script. There was an issue like this here https://github.com/hashicorp/terraform/issues/14986, though that one seems to be fixed.

Could you try changing https://github.com/rancher/quickstart/blob/master/gcp/infra.tf#L130 to a hardcoded path that points to the file instead of using path.module? Alternatively, you could try running terraform in WSL.

wofr commented 2 years ago

Now I did find time to give it a try with an hard-coded path in the infra.tf. This time the script ended up on the quickstart-node but docker was not started. I dropped the whole VM and thought it would be re-applied after running terrafrom-apply again but now I run into several issues regarding missing permission, to create the quick-start-node again.

Keeping the story short I think the issue is realted to the path to the script file in the infra.tf and how terraform works on windows.

bashofmann commented 2 years ago

I finally had some time today to reproduce this issue on Windows as well. Turns out it's not because if directory path separators, but because of line endings. By default, a git clone on Windows sets all line endings to CRLF, even though they are LF in the repository. Because of this, the GCP metadata startup script also contains CRLF line endings, which do not work on GCP (https://github.com/hashicorp/terraform/issues/17005).

Enforcing LF line endings through a git attribute also on Windows fixes this issue: https://github.com/rancher/quickstart/commit/8f427b2ec4447ab09f17b1212ef262d2b0f44d72.

When the repository is downloaded as a ZIP archive from Github, the line endings are also kept correctly as LF.