rancher / os

Tiny Linux distro that runs the entire OS as Docker containers
https://rancher.com/docs/os/v1.x/en/
Apache License 2.0
6.45k stars 657 forks source link

RancherOS sometimes leaves a defunct dockerd after ros console switch #2642

Open hpdvanwyk opened 5 years ago

hpdvanwyk commented 5 years ago

RancherOS Version: (ros os version) v1.5.0

Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.) Virtualbox

RancherOS will sometimes leave a defunct dockerd when switching consoles using ros console switch something. The pid file at /var/run/docker.pid is also not cleaned up and the user-docker container is unable to start up. This only seems to happen if Kubernetes has been installed on RancherOS and not on a clean install.

Rebooting fixes the problem.

To reproduce in Virtualbox:

cloud-config.yml

#cloud-config
ssh_authorized_keys:
  - ssh-rsa ...
rancher:
  network:
    interfaces:
      eth0:
        dhcp: true
      eth1:
        address: 192.168.99.53/24
        mtu: 1500
        dhcp: false

Installed using the RancherOS iso with:

sudo ros install -c http://whereveryouhostit/cloud-config.yml -d /dev/sda

rancher-cluster.yml

ssh_agent_auth: true
nodes:
  - address: 192.168.99.53
    user: rancher
    role: [controlplane,worker,etcd]

services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h
  kube-controller:
    # CIDR pool used to assign IP addresses to pods in the cluster
    cluster_cidr: 10.42.0.0/16

ignore_docker_version: true

network:
    plugin: flannel
    options:
      flannel_iface: eth1
      flannel_backend_type: vxlan
rke up --config ./rancher-cluster.yml

Now ssh to the rancher machine and switch the console:

rancher@rancher:~$ sudo ros console switch ubuntu
...
Connection to 192.168.99.53 closed.
$:~/rancherostesting$ ssh rancher@192.168.99.53
...
rancher@rancher:~$ docker ps
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.38/containers/json: dial unix /var/run/docker.sock: connect: permission denied

After some debugging:

rancher@rancher:/var/log$ tail docker.log 
Error starting daemon: pid file found, ensure docker is not running or delete /var/run/docker.pid
...
rancher@rancher:/var/log$ cat /var/run/docker.pid 
765
rancher@rancher:/var/log$ ps aux | grep 765
root       765  1.6  0.0      0     0 ?        Zs   11:28   0:08 [dockerd] <defunct>
rancher@rancher:/var/log$ ps  xao pid,ppid,pgid,sid,comm | grep dockerd
    1     0     0     0 system-dockerd
  765     1   765   765 dockerd <defunct>
AliMD commented 5 years ago

Same issue! I switched the console to Debian in all nodes (Automatically with maintenance script) and kubernetes cluster down completely!!! We installed K8s with RancherOS for zero downtime 😕 Any solution?

sudo ros os version
v1.5.2

tail /var/log/docker.log
Error starting daemon: pid file found, ensure docker is not running or delete /var/run/docker.pid

ls -lAhF /var/run/docker.pid 
-rw-r--r-- 1 root root 4 Aug 11 05:07 /var/run/docker.pid