ionos-cloud / docker-machine-driver

IONOS Cloud Docker Machine Driver
Apache License 2.0
6 stars 3 forks source link

Connect to Docker Daemon failed #56

Closed mueller-tobias closed 1 year ago

mueller-tobias commented 1 year ago

Description

Trying to create a Kubernetes Cluster in a Public LAN with Default Settings (SSH User = root) and no Cloud-init failed. The datacenter and the vm is created successfully without error. The VM has a public vm and the ssh port is open. When Rancher tries to connect to the docker daemon the workflows hangs with the errors below.

Expected behavior

Create a working single node kubernetes cluster

Environment

Rancher Version:

2.6.9

Docker Machine Driver Ionos Cloud version:

6.1.0rc1

How to Reproduce

Create a Node Template with no DataCenter ID and no LAN ID and the default settings for a datacenter in frankfurt.

Error and Debug Output

[INFO ] Initiating Kubernetes cluster
--
3:06:06 pm | [INFO ] [dialer] Setup tunnel for host [212.227.151.232]
3:06:06 pm | [ERROR] Failed to set up SSH tunneling for host [212.227.151.232]: Can't retrieve Docker Info: error during connect: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info": can not build dialer to [c-s74s7:m-q7jz9]
3:06:06 pm | [ERROR] Removing host [212.227.151.232] from node lists
3:06:06 pm | [ERROR] [state] can't fetch legacy cluster state from Kubernetes: Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [212.227.151.232]
3:06:07 pm | [INFO ] Successfully Deployed state file at [management-state/rke/rke-3775529018/cluster.rkestate]
3:06:07 pm | [INFO ] Building Kubernetes cluster
3:06:07 pm | [ERROR] Cluster must have at least one etcd plane host: please specify one or more etcd in cluster config
3:06:42 pm | [INFO ] Initiating Kubernetes cluster
3:06:42 pm | [INFO ] Successfully Deployed state file at [management-state/rke/rke-2037188006/cluster.rkestate]
3:06:42 pm | [INFO ] Building Kubernetes cluster
3:06:42 pm | [INFO ] [dialer] Setup tunnel for host [212.227.151.232]
3:06:42 pm | [ERROR] Failed to set up SSH tunneling for host [212.227.151.232]: Can't retrieve Docker Info: error during connect: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info": can not build dialer to [c-s74s7:m-q7jz9]
3:06:42 pm | [ERROR] Removing host [212.227.151.232] from node lists
3:06:42 pm | [ERROR] Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [212.227.151.232]
3:07:47 pm | [INFO ] Initiating Kubernetes cluster
3:07:47 pm | [INFO ] Successfully Deployed state file at [management-state/rke/rke-2123286596/cluster.rkestate]
3:07:47 pm | [INFO ] Building Kubernetes cluster
3:07:47 pm | [INFO ] [dialer] Setup tunnel for host [212.227.151.232]
3:07:47 pm | [ERROR] Failed to set up SSH tunneling for host [212.227.151.232]: Can't retrieve Docker Info: error during connect: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info": can not build dialer to [c-s74s7:m-q7jz9]
3:07:47 pm | [ERROR] Removing host [212.227.151.232] from node lists
3:07:47 pm | [ERROR] Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [212.227.151.232]
avirtopeanu-ionos commented 1 year ago

Hi, what is your node template configuration? You can view it as a JSON by clicking on the View in API button - but please make sure not to include your API Token or username and password. Also, are you using any custom Engine Options? (such as a custom Docker install URL or a custom Storage Driver)

Note that currently ubuntu:22.04 image isn't supported because id_rsa ssh keys aren't supported for that OS

mueller-tobias commented 1 year ago

I used the default ubuntu:20.04.

FYI, thats the template i used:

{
  "amazonec2Config": null,
  "annotations": {
    "ownerBindingsCreated": "true"
  },
  "baseType": "nodeTemplate",
  "cloudCredentialId": null,
  "created": "2022-12-14T13:37:41Z",
  "createdTS": 1671025061000,
  "creatorId": "user-8c6hp",
  "driver": "ionoscloud",
  "engineEnv": {},
  "engineInstallURL": "https://releases.rancher.com/install-docker/20.10.sh",
  "engineLabel": {},
  "engineOpt": {},
  "engineRegistryMirror": [],
  "id": "cattle-global-nt:nt-g6qjp",
  "ionoscloudConfig": {
    "cores": "4",
    "cpuFamily": "INTEL_SKYLAKE",
    "datacenterId": "",
    "diskSize": "50",
    "diskType": "HDD",
    "endpoint": "https://api.ionos.com/cloudapi/v6",
    "image": "ubuntu:20.04",
    "imagePassword": "abcde12345",
    "lanId": "",
    "location": "de/fra",
    "password": "",
    "ram": "2048",
    "serverAvailabilityZone": "AUTO",
    "sshUser": "root",
    "token": "****",
    "userData": "",
    "userDataB64": "",
    "username": "",
    "volumeAvailabilityZone": "AUTO"
  },
  "labels": {
    "cattle.io/creator": "norman"
  },
  "links": {
    "nodePools": "…/v3/nodePools?nodeTemplateId=cattle-global-nt%3Ant-g6qjp",
    "nodes": "…/v3/nodes?nodeTemplateId=cattle-global-nt%3Ant-g6qjp",
    "self": "…/v3/nodeTemplates/cattle-global-nt:nt-g6qjp",
    "update": "…/v3/nodeTemplates/cattle-global-nt:nt-g6qjp"
  },
  "logOpt": {},
  "name": "Test without Datacenter",
  "principalId": "local://user-8c6hp",
  "state": "active",
  "storageOpt": {},
  "transitioning": "no",
  "transitioningMessage": "",
  "type": "nodeTemplate",
  "useInternalIpAddress": true,
  "uuid": "fc50f8f3-85fd-497f-a98e-c2c6fb139239"
}
mrndev commented 1 year ago

Hi Tobias, can you tell where to get the debug output (i.e. the listing above)? Do I need to list the logs for a specific container in the RKE worker cluster? Or in the master cluster? Do you know what is currently happening when the failure occurs (i.e. in which part does the provisioning fail? Is the master trying to build an ssh tunnel to the worker cluster? Thanks

mrndev commented 1 year ago

Ok - i see that its in the Provisioning log in the rancher GUI. At least in my case it seemed to work. If its any help, here is a snippet from my log: 1:59:28 pm | [INFO ] Initiating Kubernetes cluster 1:59:28 pm | [INFO ] [dialer] Setup tunnel for host [157.97.110.225] 1:59:37 pm | [INFO ] [state] Successfully started [cluster-state-deployer] container on host [157.97.110.225] 1:59:38 pm | [INFO ] Successfully Deployed state file at [management-state/rke/rke-315640628/cluster.rkestate] 1:59:38 pm | [INFO ] Building Kubernetes cluster 1:59:38 pm | [INFO ] [dialer] Setup tunnel for host [157.97.110.225] 1:59:38 pm | [INFO ] [network] Deploying port listener containers 1:59:39 pm | [INFO ] [network] Successfully started [rke-etcd-port-listener] container on host [157.97.110.225] 1:59:39 pm | [INFO ] [network] Successfully started [rke-cp-port-listener] container on host [157.97.110.225] 1:59:40 pm | [INFO ] [network] Successfully started [rke-worker-port-listener] container on host [157.97.110.225] ...

I have the rancher master running on a K3s cluster in one data center, from where I provisioned the RKE cluster in another data center (public LAN). could the error come from some proxy or security settings in the network where your rancher master is running? if you shell into the rancher master pod and try to ssh from there manually into the worker node, what happens then? Just throwing some ideas in hope that it helps...

mueller-tobias commented 1 year ago

hi martin, the rancher master on my tests was running on a k3s cluster and i tried to deploy the cluster in another datacenter. The Rancher is in a private lan with a nat gateway. I'll do some test if i can connect from the rancher vm to the downstream cluster vm via ssh.

mueller-tobias commented 1 year ago

For the ssh tests i tried to add my ssh key to the cloud-config. But when i add a cloud-config the driver has a problem to create the server. The cloud is a working one from a other cluster i used for some tests in out lab. The other cluster is also using a ubuntu-20.04 cloud image.

#cloud-config
ca-certs:
  trusted:
    - |
      -----BEGIN CERTIFICATE-----
      MIIBqTCCAU6gAwIBAgIRAIEufsGXTRyUC4tsIW398SMwCgYIKoZIzj0EAwIwMjET
      MBEGA1UEChMKRGV2T3BzIExhYjEbMBkGA1UEAxMSRGV2T3BzIExhYiBSb290IENB
      MB4XDTIyMDExNTA4MzAwMVoXDTMyMDExMzA4MzAwMVowMjETMBEGA1UEChMKRGV2
      T3BzIExhYjEbMBkGA1UEAxMSRGV2T3BzIExhYiBSb290IENBMFkwEwYHKoZIzj0C
      AQYIKoZIzj0DAQcDQgAE3CcGpgd5/jMDt42nOB98DVoppAdZ1vY0Us2WrtQ7nv5s
      iZenDiImG9TdceR3P7a2wvnhUAmiBiZzT0yx/mlcwqNFMEMwDgYDVR0PAQH/BAQD
      AgEGMBIGA1UdEwEB/wQIMAYBAf8CAQEwHQYDVR0OBBYEFFQ89/6jz4Qi4T59BHYC
      qJljaNTqMAoGCCqGSM49BAMCA0kAMEYCIQDI5Zsng3vQTJQm3TiNtFClS+xcIIYz
      BASuCGiG6LmZ7wIhAJXHwrPpXjEV8B4ML0QX3IwIh3cvA+iLXoHAtvolF5+0
      -----END CERTIFICATE-----
groups:
  - docker
manage_etc_hosts: true
runcmd:
  - - sysctl
    - '-p'
users:
  - groups: 'docker, sudo'
    name: ubuntu
    ssh-authorized-keys:
      - >
        ssh-rsa
        ******
        deployment-key
    sudo:
      - 'ALL=(ALL) NOPASSWD:ALL'
write_files:
  - content: |
      Acquire::ForceIPv4 "true";
    path: /etc/apt/apt.conf.d/99disable-ipv6
  - content: |
      Acquire::ForceIPv4 "true";
    path: /etc/apt/apt.conf.d/99disable-ipv6
  - content: |
      net.ipv6.conf.all.disable_ipv6 = 1
      net.ipv6.conf.default.disable_ipv6 = 1
      net.ipv6.conf.lo.disable_ipv6 = 1
    path: /etc/sysctl.d/99-sysctl.conf
packageUpdate: true
packages:
  - nfs-common

This it the log output from the rancher container:

2022/12/21 06:47:01 [INFO] [node-controller] Provisioning node ionos-k8s1
2022/12/21 06:47:01 [INFO] [node-controller] Creating CA: /management-state/node/nodes/ionos-k8s1/certs/ca.pem
2022/12/21 06:47:02 [INFO] [node-controller] Creating client certificate: /management-state/node/nodes/ionos-k8s1/certs/cert.pem
2022/12/21 06:47:02 [INFO] [node-controller] Running pre-create checks...
2022/12/21 06:47:02 [INFO] [node-controller] (ionos-k8s1) IONOS Cloud Driver Version: 6.1.0-rc.1
2022/12/21 06:47:02 [INFO] [node-controller] (ionos-k8s1) SDK-GO Version: 6.1.3
2022/12/21 06:47:02 [INFO] [node-controller] Creating machine...
2022/12/21 06:47:03 [INFO] [node-controller] (ionos-k8s1) Creating SSH key...
2022/12/21 06:47:03 [INFO] [node-controller] (ionos-k8s1) Using user data: users:
2022/12/21 06:47:03 [INFO] [node-controller] (ionos-k8s1)   - groups: 'docker, sudo'
2022/12/21 06:47:03 [INFO] [node-controller] (ionos-k8s1)     name: tobias
2022/12/21 06:47:03 [INFO] [node-controller] (ionos-k8s1)     ssh-authorized-keys:
2022/12/21 06:47:03 [INFO] [node-controller] (ionos-k8s1)       - >
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)         ssh-rsa
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)         ******
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)         deployment-key
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)     sudo:
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)       - 'ALL=(ALL) NOPASSWD:ALL'
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1) write_files:
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)   - content: |
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)       Acquire::ForceIPv4 "true";
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)     path: /etc/apt/apt.conf.d/99disable-ipv6
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)   - content: |
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)       Acquire::ForceIPv4 "true";
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)     path: /etc/apt/apt.conf.d/99disable-ipv6
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)   - content: |
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)       net.ipv6.conf.all.disable_ipv6 = 1
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)       net.ipv6.conf.default.disable_ipv6 = 1
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)       net.ipv6.conf.lo.disable_ipv6 = 1
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)     path: /etc/sysctl.d/99-sysctl.conf
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1) packageUpdate: true
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1) packages:
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1)   - nfs-common
2022/12/21 06:47:04 [INFO] [node-controller] (ionos-k8s1) DataCenter Created
2022/12/21 06:47:14 [INFO] [node-controller] (ionos-k8s1) LAN Created
2022/12/21 06:47:22 [INFO] [node-controller] (ionos-k8s2) NIC Deleted
2022/12/21 06:47:24 [INFO] [node-controller] (ionos-k8s1) Server Created
2022/12/21 06:47:35 [INFO] [node-controller] (ionos-k8s1) Image Alias: ubuntu:20.04
2022/12/21 06:47:35 [INFO] [node-controller] (ionos-k8s1) WARNING: Error creating machine. Rolling back...
2022/12/21 06:47:35 [INFO] [node-controller] (ionos-k8s1) NOTICE: Please check IONOS Cloud Console/CLI to ensure there are no leftover resources.
2022/12/21 06:47:35 [INFO] [node-controller] (ionos-k8s1) Starting deleting resources...
2022/12/21 06:47:42 [INFO] [node-controller] (ionos-k8s2) Volume Deleted
2022/12/21 06:47:45 [INFO] [node-controller] (ionos-k8s1) Server Deleted
2022/12/21 06:47:53 [INFO] [node-controller] (ionos-k8s2) Server Deleted
2022/12/21 06:47:56 [INFO] [node-controller] (ionos-k8s1) LAN Deleted
2022/12/21 06:48:03 [INFO] [node-controller] (ionos-k8s2) LAN Deleted
2022/12/21 06:48:06 [INFO] [node-controller] (ionos-k8s1) DataCenter Deleted
mueller-tobias commented 1 year ago

I used the ssh keys created from the driver to connect to the new vm. The connection with the ssh keys are not the problem. Not from the rancher pod nor from my laptop. Docker for example is installed via ssh when the cloud-init is done. Here's the output from docker info from the node created from the driver:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
  compose: Docker Compose (Docker Inc., v2.14.1)
  scan: Docker Scan (Docker Inc., v0.23.0)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9ba4b250366a5ddde94bb7c9d1def331423aa323
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-125-generic
 Operating System: Ubuntu 20.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 3.81GiB
 Name: ionos-k8s2
 ID: SLXH:VGFI:Z2OJ:4GVA:6UN4:NHDX:XG22:AMMH:MO74:EDMQ:CVED:IVCH
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
mrndev commented 1 year ago

I had exactly the same problem with my cloudconfig/userdata (error creating machine...). This is apparently already fixed in the main branch, but @avirtopeanu-ionos - I think we need a new RC to get further.

avirtopeanu-ionos commented 1 year ago

Hi, https://github.com/ionos-cloud/docker-machine-driver/releases/tag/v6.1.0-rc.2 this should fix the issue with the cloudconfig / userdata.

mrndev commented 1 year ago

Hi Tobias, did you try the new release candinate? Did it fix the issue? Thanks, Martin

mueller-tobias commented 1 year ago

I was on vacation over the holidays. I'll do some tests later today or tomorrow.

mueller-tobias commented 1 year ago

Last week i was out of order and couldn't do the tests as planned. I'll take a look at the bug and the fixes in rc2 today and tomorrow.

mueller-tobias commented 1 year ago

Bug is fixed with v6.1.0-rc.2. I could successfully create a working kubernetes cluster.