rancher / os

Tiny Linux distro that runs the entire OS as Docker containers
https://rancher.com/docs/os/v1.x/en/
Apache License 2.0
6.45k stars 657 forks source link

Processes stuck when rebooting on Proxmox VE #2647

Open kingsd041 opened 5 years ago

kingsd041 commented 5 years ago

RancherOS Version: (ros os version) v1.5.0 Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.) Proxmox VE

I can see the following error log in respawn.log

[root@autofmt-11 log]# cat /var/log/boot/respawn.log
time="2019-01-14T08:38:18Z" level=debug msg="START: [respawn -f /etc/respawn.conf] in /"
time="2019-01-14T08:38:18Z" level=info msg="respawn RancherOS
built: '2019-01-14T06:01:33Z', 4cd81db-dirty"
time="2019-01-14T08:42:09Z" level=info msg="sending SIGTERM to 1260"
time="2019-01-14T08:42:09Z" level=info msg="sending SIGTERM to 1270"
time="2019-01-14T08:42:09Z" level=info msg="sending SIGTERM to 1253"
time="2019-01-14T08:42:09Z" level=info msg="sending SIGTERM to 1239"
time="2019-01-14T08:42:09Z" level=info msg="sending SIGTERM to 1272"
time="2019-01-14T08:42:09Z" level=info msg="sending SIGTERM to 1274"
time="2019-01-14T08:42:09Z" level=info msg="sending SIGTERM to 1250"
time="2019-01-14T08:42:09Z" level=error msg="Wait cmd to exit: /sbin/agetty --noclear tty6 linux, err: signal: terminated"
time="2019-01-14T08:42:09Z" level=info msg="/sbin/agetty --noclear tty6 linux : not restarting, exiting"
time="2019-01-14T08:42:09Z" level=info msg="FINISHED: /sbin/agetty --noclear tty6 linux"
time="2019-01-14T08:42:09Z" level=error msg="Wait cmd to exit: /sbin/agetty --noclear tty3 linux, err: signal: terminated"
time="2019-01-14T08:42:09Z" level=info msg="/sbin/agetty --noclear tty3 linux : not restarting, exiting"
time="2019-01-14T08:42:09Z" level=info msg="FINISHED: /sbin/agetty --noclear tty3 linux"
time="2019-01-14T08:42:09Z" level=info msg="/usr/sbin/sshd -D : not restarting, exiting"
time="2019-01-14T08:42:09Z" level=info msg="FINISHED: /usr/sbin/sshd -D"
time="2019-01-14T08:42:09Z" level=error msg="Wait cmd to exit: /sbin/agetty --noclear tty4 linux, err: signal: terminated"
time="2019-01-14T08:42:09Z" level=info msg="/sbin/agetty --noclear tty4 linux : not restarting, exiting"
time="2019-01-14T08:42:09Z" level=info msg="FINISHED: /sbin/agetty --noclear tty4 linux"
time="2019-01-14T08:42:09Z" level=error msg="Wait cmd to exit: /sbin/agetty -n -l /usr/bin/autologin -o rancher:tty1 --noclear tty1 linux, err: signal: terminated"
time="2019-01-14T08:42:09Z" level=info msg="/sbin/agetty -n -l /usr/bin/autologin -o rancher:tty1 --noclear tty1 linux : not restarting, exiting"
time="2019-01-14T08:42:09Z" level=info msg="FINISHED: /sbin/agetty -n -l /usr/bin/autologin -o rancher:tty1 --noclear tty1 linux"
time="2019-01-14T08:42:09Z" level=error msg="Wait cmd to exit: /sbin/agetty --noclear tty2 linux, err: signal: terminated"
time="2019-01-14T08:42:09Z" level=info msg="/sbin/agetty --noclear tty2 linux : not restarting, exiting"
time="2019-01-14T08:42:09Z" level=info msg="FINISHED: /sbin/agetty --noclear tty2 linux"
time="2019-01-14T08:42:09Z" level=error msg="Wait cmd to exit: /sbin/agetty --noclear tty5 linux, err: signal: terminated"
time="2019-01-14T08:42:09Z" level=info msg="/sbin/agetty --noclear tty5 linux : not restarting, exiting"
time="2019-01-14T08:42:09Z" level=info msg="FINISHED: /sbin/agetty --noclear tty5 linux"
time="2019-01-14T08:45:28Z" level=debug msg="START: [respawn -f /etc/respawn.conf] in /"
time="2019-01-14T08:45:28Z" level=info msg="respawn RancherOS
built: '2019-01-14T06:01:33Z', 4cd81db-dirty"

image

lkraider commented 4 years ago

I am having the same issue, the guest receives the shutdown command but nothing happens:

proxmox: /var/log/messages

Oct  5 14:22:39 server pve-guests[920]: <root@pam> starting task UPID:server:00000E8A:0CDD36F8:5D98A72F:qmshutdown:107:root@pam:

rancher: /var/log/messages

Oct  5 14:22:39 rancher qemu-ga: info: guest-shutdown called, mode: (null)

rancher: /var/log/boot/shutdown.log

time="2019-10-05T14:22:39Z" level=debug msg="START: [shutdown -h -P +0 hypervisor initiated shutdown] in /"
time="2019-10-05T14:22:39Z" level=error msg="Sorry, can't parse '+0' as time value (only 'now' supported)"

At this point proxmox shows Guest Agent not running on the VM UI and keeps waiting on the shutdown.

Rancher was installed using rancheros-proxmoxve.iso 1.5.1 and upgraded to 1.5.3 using the ros os command.

Proxmox version 5.4.

GreenTeaBalls commented 4 years ago

I have the same issue with getting RancherOS to shutdown when run as a VM under Proxmox.

Any chance this may be treated as a bug (and possibly be fixed earlier than an undated 1.6.0 release)?

It makes running RancherOS on Proxmox unfeasible since if my power goes out Proxmox can't shutdown RancherOS and all my containers will in the end be killed or stopped when the UPS runs out of power...

bf8392 commented 4 years ago

Have the same issue...I think I've found the the problem...the quemu guest agent container shuts down, but it doesn't shut down the host system... I use the apcupsd docker also, and it's developer solved the problem with a cronjob and a script:

https://github.com/gersilex/apcupsd-docker

maybe it helps solving the issue...

GreenTeaBalls commented 4 years ago

response_container_BBPPID{font-family: initial; font-size:initial; color: initial;} Thanks for the workaround!I have to admit that I have abandoned RancherOS for that reason and am currently just using a vanilla Debian install to host my docker containers. Proxmox manages to shut down this VM just fine.  From: notifications@github.comSent: 1 December 2019 18:26To: os@noreply.github.comReply to: reply@reply.github.comCc: timo@kosig.net; comment@noreply.github.comSubject: Re: [rancher/os] Processes stuck when rebooting on Proxmox VE (#2647) Have the same issue...I think I've found the the problem...the quemu guest agent container shuts down, but it doesn't shut down the host system...

I use the apcupsd docker also, and it's developer solved the problem with a cronjob and a script: https://github.com/gersilex/apcupsd-docker maybe it helps solving the issue...

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or unsubscribe.

bf8392 commented 4 years ago

I found an even easier solution. Disable quemu-guest-agent in the vm options. It then shuts down the machine via acpi event which seems to work...

lkraider commented 4 years ago

Issue seems to be here: https://github.com/rancher/os/blob/7c84c5f7e4f624c1d0229ac92fe2dbbd9c8780a3/cmd/power/shutdown.go#L218

Should just need to update the parser to accept +0 as a valid parameter.

travisghansen commented 4 years ago

I rebuilt the openstack image variants with this to get around this issue. Others may find it useful (as I still have the issue even with the latest version). Disabling the agent is a non-option for my use case.

#cloud-config

# Giant mess of qemu-guest-agent
# https://github.com/rancher/os-services/blob/master/q/qemu-guest-agent.yml
# https://github.com/qemu/qemu/blob/master/qga/commands-posix.c#L84
# https://github.com/rancher/os/issues/2822
# https://github.com/rancher/os/issues/2647
# https://ezunix.org/index.php?title=Prevent_suspending_when_the_lid_is_closed_on_a_laptop_in_RancherOS
#
# The problem is overly complicated because of 2 things:
# - qemu-ga is hard-coded to invoke /sbin/shutdown (cannot simply create a wrapper higher in the PATH)
# - the rancher qemu-guest-agent service mounts 'volumes_from'
#    which bind mount the above path, so it's impossible to use
#    the supported image, therefor we've replaced it with a generic
#    qemu-guest-agent image
#
# Also note, due to weirdness, we simply bind mount the system-docker
# binary into the contaier and exec ros in *another* container to
# actually trigger the reboot

runcmd:
- sudo rm -rf /var/lib/rancher/resizefs.done

# when the qemu-guest-agent issue is fixed, all agent-related garbage below
# can simply be replaced by this..
#- sudo ros service enable qemu-guest-agent
#- sudo ros service up qemu-guest-agent

rancher:
  resize_device: /dev/sda

  services:
    qemu-guest-agent:
      image: linuxkit/qemu-ga:v0.8
      command: /usr/bin/qemu-ga
      privileged: true
      restart: always
      labels:
        io.rancher.os.scope: system
        io.rancher.os.after: console
      pid: host
      ipc: host
      net: host
      uts: host
      volumes:
      - /dev:/dev
      - /usr/bin/ros:/usr/bin/ros
      - /var/run:/var/run
      - /usr/bin/system-docker:/usr/bin/system-docker
      - /home/rancher/overlay/qemu-guest-agent/etc/qemu/qemu-ga.conf:/etc/qemu/qemu-ga.conf
      - /home/rancher/overlay/qemu-guest-agent/sbin/shutdown:/sbin/shutdown
      volumes_from:
      - system-volumes
      - user-volumes

write_files:

  # not required, just here in case I want to enable verbose for testing
  - path: /home/rancher/overlay/qemu-guest-agent/etc/qemu/qemu-ga.conf
    permissions: "0755"
    owner: root
    content: |
      [general]
      daemon=false
      method=virtio-serial
      path=/dev/virtio-ports/org.qemu.guest_agent.0
      pidfile=/var/run/qemu-ga.pid
      statedir=/var/run
      verbose=false
      blacklist=

  - path: /home/rancher/overlay/qemu-guest-agent/sbin/shutdown
    permissions: "0755"
    owner: root
    content: |
      #!/bin/sh

      ARGS=$(echo "${@}" | sed 's/+0/now/g')
      system-docker exec console ros entrypoint shutdown $ARGS
scarfacestrawberry commented 3 years ago

Still an issue. travisghansen's solution still works

joshspicer commented 3 years ago

still seeing this as well.

I (reluctantly) just turned off the qemu agent to get around this

scarfacestrawberry commented 3 years ago

@joshspicer don't expect it to ever be fixed, RancherOS is dead. Look at the official documentation https://rancher.com/docs/os/v1.x/en/