mKeRix / room-assistant

Presence tracking and more for automation on the room-level
https://www.room-assistant.io
MIT License
1.27k stars 122 forks source link

Resetting hci0 in Docker container is causing zombie processes #157

Closed iicky closed 4 years ago

iicky commented 4 years ago

Describe the bug I am getting a lot of zombie processes popping up as a result of my room-assistant Docker container. It appears as though whenever the BluetoothClassicService query takes too long and hci0 is reset, it becomes a zombie process. This results in tons of zombie processes over time until I restart the room-assistant container, at which point they all are destroyed.

This issue is specifically impacting my deployment on my Intel NUC and is not occurring on my Raspberry Pi deploys.

To reproduce Deploy using the docker-compose file and config below.

Relevant logs

Room Assistant logs.

docker logs -f room-assistant

[Nest] 1   - 03/22/2020, 11:44:54 AM   [BluetoothClassicService] Query of xx:xx:xx:xx:xx:xx took too long, resetting hci0
[Nest] 1   - 03/22/2020, 11:46:24 AM   [BluetoothClassicService] Query of xx:xx:xx:xx:xx:xx took too long, resetting hci0
[Nest] 1   - 03/22/2020, 11:46:35 AM   [BluetoothClassicService] Query of xx:xx:xx:xx:xx:xx took too long, resetting hci0
[Nest] 1   - 03/22/2020, 11:46:36 AM   [BluetoothClassicService] Query of xx:xx:xx:xx:xx:xx took too long, resetting hci0
[Nest] 1   - 03/22/2020, 11:55:06 AM   [BluetoothClassicService] Query of xx:xx:xx:xx:xx:xx took too long, resetting hci0

Zombie processes.

user@core:~$ ps auxwww | grep ' Z '
4 Z root      5859 28012  0  80   0 -     0 -      13:44 ?        00:00:00 [hcitool] <defunct>
4 Z root      7486 28012  0  80   0 -     0 -      13:46 ?        00:00:00 [hcitool] <defunct>
4 Z root      7672 28012  0  80   0 -     0 -      13:46 ?        00:00:00 [hcitool] <defunct>
4 Z root      7713 28012  0  80   0 -     0 -      13:46 ?        00:00:00 [hcitool] <defunct>

Relevant configuration

Docker Compose docker-compose.yaml

version: "3.1"
services:

  # Room Assistant-----------------------------------
  room-assistant:
    container_name: room-assistant
    image: mkerix/room-assistant
    network_mode: host
    ports:
      - 6425:6425
    cap_add:
      - NET_ADMIN
    volumes:
      - /var/run/dbus:/var/run/dbus
      - ./room-assistant/config:/room-assistant/config
      - /etc/localtime:/etc/localtime:ro
    restart: always

Room Assistant config local.yml

global:
  instanceName: office
  integrations:
    - homeAssistant
    - bluetoothClassic
homeAssistant:
  mqttUrl: 'mqtt://XXX.XXX.XXX.XXX:1883'
  mqttOptions:
    username: <username>
    password: <password>
cluster:
  networkInterface: eno1
  port: 6425
  peerAddresses:
    - <raspberry_pi_zero_1>:6425 
    - <raspberry_pi_zero_1>:6425
bluetoothClassic:
  addresses:
    - '<mac_address_1>'
    - '<mac_address_2>' 

Expected behavior I expect the zombie processes to not appear.

Environment

Additional context

Zombie processes are not appearing on both Raspberry Pis that have room-assistant installed.

mKeRix commented 4 years ago

Thanks for the bug report - I'll try to reproduce it with my Ubuntu NUC at home. In theory the processes should be hard killed by NodeJS, that was also what I was observing when I tried it. I didn't try it with this specific setup though, so maybe we need some small adaptions in the Dockerfile.

iicky commented 4 years ago

@mKeRix Awesome, thanks! Let me know if you need any additional info.

mwasowski commented 4 years ago

I have the same issue. Docker logs

[Nest] 1   - 04/03/2020, 5:00:53 PM   [BluetoothClassicService] Query of XX:XX:XX:XX:XX:XX took too long, resetting hci0
[Nest] 1   - 04/03/2020, 5:01:23 PM   [BluetoothClassicService] Query of XX:XX:XX:XX:XX:XX took too long, resetting hci0
[Nest] 1   - 04/03/2020, 5:01:29 PM   [BluetoothClassicService] Query of XX:XX:XX:XX:XX:XX took too long, resetting hci0
[Nest] 1   - 04/03/2020, 5:01:35 PM   [BluetoothClassicService] Query of XX:XX:XX:XX:XX:XX took too long, resetting hci0

Zombie processes

root      3779  0.0  0.0      0     0 ?        Z    19:00   0:00 [hcitool] <defunct>
root      3782  0.0  0.0      0     0 ?        Z    19:00   0:00 [hcitool] <defunct>
root      3799  0.0  0.0      0     0 ?        Z    19:01   0:00 [hcitool] <defunct>
root      3802  0.0  0.0      0     0 ?        Z    19:01   0:00 [hcitool] <defunct>
root      3805  0.0  0.0      0     0 ?        Z    19:01   0:00 [hcitool] <defunct>

Environment

room-assistant version: 2.2.0
installation type: Docker
hardware: ProxmoxVE
OS: Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux

Config Docker

version: '3'
services:
  room-assistant:
    container_name: room-assistant
    image: mkerix/room-assistant
    restart: unless-stopped
    network_mode: host
    cap_add:
      - NET_ADMIN
    volumes:
      - /var/run/dbus:/var/run/dbus
      - /etc/room-assistant/config:/room-assistant/config

Room asistant

global:
  instanceName: Home
  integrations:
    - homeAssistant
    - bluetoothClassic
homeAssistant:
  mqttUrl: 'mqtt://XXX.XXX.XXX.XXX:1883'
  mqttOptions:
    username: <user>
    password: <pass>
bluetoothClassic:
  addresses:
    - 'XX:XX:XX:XX:XX:XX'
iicky commented 4 years ago

I just tested today using a USB Bluetooth dongle on the NUC and changing the device to hci1. The result was even more zombie processes, so I don't think it is the Bluetooth device.

mwasowski commented 4 years ago

Agreed. Also the fact, that we all use different dongles/built in chips should theoretically rule out hardware issue per se.

mKeRix commented 4 years ago

I also think it's unlikely that this is related to hardware. My current idea is that the hcitool on Alpine Linux (what is used for the Docker images) doesn't handle SIGKILL correctly (or NodeJS doesn't send the signal correctly). As Alpine Linux is rarely used outside of Docker that would explain why we are only seeing the issues there.

iicky commented 4 years ago

I tried out a Debian image using the following Dockerfile and I am still getting the same zombie processes. I did my best to find matching or similar packages so I'm not 100% sure the image is a complete Debian replacement, but I can confirm that the zombie processes are still appearing with Debian.

FROM node:12-slim as build
ARG ROOM_ASSISTANT_VERSION=latest

RUN apt-get update && apt-get upgrade -y
RUN apt-get install -y python g++ make libusb-dev avahi-utils libavahi-compat-libdnssd-dev

RUN npm install -g --unsafe-perm room-assistant@$ROOM_ASSISTANT_VERSION

FROM node:12-slim

WORKDIR /room-assistant

RUN apt-get update && apt-get install -y bluez libusb-dev avahi-utils dmidecode libavahi-compat-libdnssd1

RUN ln -s /usr/local/lib/node_modules/room-assistant/bin/room-assistant.js /usr/local/bin/room-assistant
COPY --from=build /usr/local/lib/node_modules/room-assistant /usr/local/lib/node_modules/room-assistant

ENTRYPOINT ["room-assistant"]
CMD ["--digResolver"]
mKeRix commented 4 years ago

Thanks for checking that already @iicky - saved me some work. I reproduced the issue on a Raspi 3 with Docker today and found a fix. Expect it to be released sometime later today.

The issue arose due to the way Docker manages processes, or rather that NodeJS wasn't made to be PID 1. There is some more information in this article.

github-actions[bot] commented 4 years ago

:tada: This issue has been resolved in version 2.4.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

iicky commented 4 years ago

Perfect - I can confirm that there are no more zombie processes after the update. Thanks so much!

mwasowski commented 4 years ago

Same here, works like a charm, fantastic work! Thanks for the fix and the link to give us a bit more background.