moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.34k stars 611 forks source link

error on running swarm [manager stopped: can't initialize raft node: WAL error cannot be repaired: unexpected EOF] #2959

Open rodoufu opened 4 years ago

rodoufu commented 4 years ago

The problem happened after the machine run out of space. Now I cannot leave the swarm either create new containers.

docker service ls
Error response from daemon: This node is not a swarm manager. Worker nodes can't be used to view or modify cluster state. Please run this command on a manager node or promote the current node to a manager.

manager node

docker info
Containers: 35
 Running: 14
 Paused: 0
 Stopped: 21
Images: 77
Server Version: 17.09.0-ce
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: error
 NodeID:
 Error: manager stopped: can't initialize raft node: WAL error cannot be repaired: unexpected EOF
 Is Manager: false
 Node Address: 10.10.10.62
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-327.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 46.77GiB
Name: BJ-H03-12-cm.getui
ID: 34OK:O5JK:V3PU:SMDX:6SJS:ZT76:CIZ4:AHX7:OKAT:U2SK:LFGR:7T2S
Docker Root Dir: /app/docker/dataroot
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Registry Mirrors:
 https://kohnnhik.mirror.aliyuncs.com/
Live Restore Enabled: false

WARNING: overlay: the backing xfs filesystem is formatted without d_type support, which leads to incorrect behavior. Reformat the filesystem with ftype=1 to enable d_type support. Running without d_type support will not be supported in future releases. WARNING: bridge-nf-call-ip6tables is disabled

I've tried to leave the swarm but it hasn't worked:

$ docker swarm leave
Error response from daemon: context deadline exceeded
$ docker swarm init
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
$ docker swarm leave --force
Error response from daemon: context deadline exceeded

Similar and unsolved: https://github.com/docker/classicswarm/issues/2819

rodoufu commented 4 years ago

It may interest @foxundermoon and @stowns

Sadbot commented 4 years ago

The same problem:

$ docker info
Containers: 32
 Running: 16
 Paused: 0
 Stopped: 16
Images: 20
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: error
 NodeID: 
 Error: manager stopped: can't initialize raft node: WAL error cannot be repaired: unexpected EOF
 Is Manager: false
 Node Address: 10.ххх.ххх.ххх
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-862.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: ххх
Total Memory: хххGiB
Name: sks06mpbl001
ID: OHEB:MSQ4:YYOF:PEWL:KOCU:I3BU:XWM4:3R3Y:NI54:ZHIS:L2LW:6GC2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
HTTP Proxy: http://127.0.0.1:3128
No Proxy: localhost,127.0.0.0/8,<host....>,<another-host>
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
 127.0.0.0/8
Registry Mirrors:
 https://<host_registry>/
Live Restore Enabled: false

Any help?

Sadbot commented 4 years ago

release free space on host machine and run sudo systemctl restart docker help it.

dreamtan commented 3 years ago

the same problem:

[root@itserver4 docker-deploy]# docker info Client: Debug Mode: false Server: Containers: 95 Running: 38 Paused: 0 Stopped: 57 Images: 102 Server Version: 19.03.9 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: error NodeID: Error: manager stopped: can't initialize raft node: WAL error cannot be repaired: unexpected EOF Is Manager: false Node Address: 10.116.200.4 Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 3.10.0-1127.8.2.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 31.16GiB Name: itserver4 ID: EDKT:BAQ2:G2UL:JVBH:ZRRW:23BB:HVB3:PFSX:DDP2:HMYV:SCNH:ZHTZ Docker Root Dir: /mnt/data/varlib_docker/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: true Insecure Registries: 127.0.0.0/8 Registry Mirrors: https://gpkhi0nk.mirror.aliyuncs.com/ Live Restore Enabled: false WARNING: API is accessible on http://0.0.0.0:2375 without encryption. Access to the remote API is equivalent to root access on the host. Refer to the 'Docker daemon attack surface' section in the documentation for more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface

dreamtan commented 3 years ago

the same problem:

[root@itserver4 docker-deploy]# docker info Client: Debug Mode: false Server: Containers: 95 Running: 38 Paused: 0 Stopped: 57 Images: 102 Server Version: 19.03.9 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: error NodeID: Error: manager stopped: can't initialize raft node: WAL error cannot be repaired: unexpected EOF Is Manager: false Node Address: 10.116.200.4 Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 3.10.0-1127.8.2.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 31.16GiB Name: itserver4 ID: EDKT:BAQ2:G2UL:JVBH:ZRRW:23BB:HVB3:PFSX:DDP2:HMYV:SCNH:ZHTZ Docker Root Dir: /mnt/data/varlib_docker/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: true Insecure Registries: 127.0.0.0/8 Registry Mirrors: https://gpkhi0nk.mirror.aliyuncs.com/ Live Restore Enabled: false WARNING: API is accessible on http://0.0.0.0:2375 without encryption. Access to the remote API is equivalent to root access on the host. Refer to the 'Docker daemon attack surface' section in the documentation for more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface

I found the cause: disk space is exausted

rodoufu commented 3 years ago

the same problem:

[root@itserver4 docker-deploy]# docker info Client: Debug Mode: false Server: Containers: 95 Running: 38 Paused: 0 Stopped: 57 Images: 102 Server Version: 19.03.9 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: error NodeID: Error: manager stopped: can't initialize raft node: WAL error cannot be repaired: unexpected EOF Is Manager: false Node Address: 10.116.200.4 Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 3.10.0-1127.8.2.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 31.16GiB Name: itserver4 ID: EDKT:BAQ2:G2UL:JVBH:ZRRW:23BB:HVB3:PFSX:DDP2:HMYV:SCNH:ZHTZ Docker Root Dir: /mnt/data/varlib_docker/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: true Insecure Registries: 127.0.0.0/8 Registry Mirrors: https://gpkhi0nk.mirror.aliyuncs.com/ Live Restore Enabled: false WARNING: API is accessible on http://0.0.0.0:2375 without encryption. Access to the remote API is equivalent to root access on the host. Refer to the 'Docker daemon attack surface' section in the documentation for more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface

I found the cause: disk space is exausted

Yes, I've described it in the issue description, it happened after the machine ran out of space. But even after I released some space it was happening.

jory3 commented 1 year ago

Old thread, but same issue: The same here: When the disk of my swarm manager ran out of space, I got the above-mentioned error message. Even after freeing space and rebooting the issue persists, I have not found a solution yet.

shrinidhi-live commented 1 year ago

@jory3 Did you find a solution eventually ?

jory3 commented 1 year ago

@shrinidhi-live unfortunately not, I finally set up the cluster again and then restored a portainer-backup.

ieugen commented 11 months ago

Hi, I am also running in this issue constantly.

We have a dev cluser with 20 nodes, 3 managers. On (one of) the manager nodes we run some workloads that for some reason fill the disk every couple of days. When this happens, the node state changes to Down and then it can't recover. I made free disk space available and restarted the whole server. The node will not rejoin the swarm cluster.

I had to remove the node from the swarm and also have it leave the swarm (both sides - he believed he was in a swarm !?). Then I joined the node again to the cluster.

This has happened to me a couple of times - so it's reproducible. I believe it happened on more than 1 node with same issue.