netdisco / netdisco-docker

Docker images for App::Netdisco
BSD 3-Clause "New" or "Revised" License
49 stars 15 forks source link

Bug, netdisco-docker is leaving zombie processes on config file change #49

Closed hveini closed 1 year ago

hveini commented 2 years ago

Expected Behavior

netdisco-backend should no leave any zombie process on config file change.

Current Behavior

netdisco-backend leaves zombie process on config file change.

# docker ps --format "table {{.ID}}\t{{.Image}}\t{{.Names}}"
CONTAINER ID   IMAGE                                   NAMES
28d3722f1500   netdisco/netdisco:2.055000-backend      docker-netdisco-backend-1
b055c2f28a9f   netdisco/netdisco:2.055000-web          docker-netdisco-web-1
9427fad0bd24   netdisco/netdisco:2.055000-postgresql   docker-netdisco-postgresql-1
# ps -AF|grep '[<]defunct>'
# touch netdisco/config/deployment.yml
# ps -AF|grep '[<]defunct>'
nd2      21473 21257  0     0     0   1 10:21 ?        00:00:00 [nd2: #1 sched: ] <defunct>
nd2      21475 21257  0     0     0   1 10:21 ?        00:00:00 [nd2: #3 poll: i] <defunct>
nd2      21476 21257  0     0     0   1 10:21 ?        00:00:00 [nd2: #4 poll: i] <defunct>
nd2      21477 21257  0     0     0   3 10:21 ?        00:00:00 [nd2: #5 poll: i] <defunct>
nd2      21478 21257  0     0     0   3 10:21 ?        00:00:00 [nd2: #6 poll: i] <defunct>
nd2      21479 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #7 poll: i] <defunct>
nd2      21480 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #8 poll: i] <defunct>
nd2      21481 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #9 poll: i] <defunct>
nd2      21482 21257  0     0     0   0 10:21 ?        00:00:00 [nd2: #10 poll: ] <defunct>
nd2      21483 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #11 poll: ] <defunct>
nd2      21484 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #12 poll: ] <defunct>
nd2      21485 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #13 poll: ] <defunct>
nd2      21486 21257  0     0     0   1 10:21 ?        00:00:00 [nd2: #14 poll: ] <defunct>
nd2      21487 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #15 poll: ] <defunct>
nd2      21488 21257  0     0     0   1 10:21 ?        00:00:00 [nd2: #16 poll: ] <defunct>
nd2      21489 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #17 poll: ] <defunct>
nd2      21490 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #18 poll: ] <defunct>
nd2      21492 21257  0     0     0   1 10:21 ?        00:00:00 [nd2: #20 poll: ] <defunct>
nd2      21493 21257  0     0     0   1 10:21 ?        00:00:00 [nd2: #21 poll: ] <defunct>
nd2      21494 21257  0     0     0   1 10:21 ?        00:00:00 [nd2: #22 poll: ] <defunct>
nd2      21495 21257  0     0     0   1 10:21 ?        00:00:00 [nd2: #23 poll: ] <defunct>
nd2      21496 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #24 poll: ] <defunct>
nd2      21497 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #25 poll: ] <defunct>
nd2      21498 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #26 poll: ] <defunct>
nd2      21499 21257  0     0     0   0 10:21 ?        00:00:00 [nd2: #27 poll: ] <defunct>
nd2      21594 21257  0     0     0   3 10:21 ?        00:00:01 [nd2: #2 mgr: id] <defunct>
nd2      21646 21257  0     0     0   2 10:21 ?        00:00:00 [nd2: #19 poll: ] <defunct>
# touch netdisco/config/deployment.yml 
# ps -AF|grep '[<]defunct>'|wc -l
54
# touch netdisco/config/deployment.yml 
# ps -AF|grep '[<]defunct>'|wc -l
81

--> each time the config file changes, the new set of workers are created and old ones are left as zombies.

Possible Solution

I managed to fix this by adding "$SIG{CHLD} = 'IGNORE';" in netdisco-backend.

I added a volume for the file in docker-compose.yml:

...
  netdisco-backend:
    image: docker.io/netdisco/netdisco:2.055000-backend
    volumes:
      - "./netdisco/nd-site-local:/home/netdisco/nd-site-local"
      - "./netdisco/config:/home/netdisco/environments"
      - "./netdisco/logs:/home/netdisco/logs"
      - "./netdisco/netdisco-backend:/home/netdisco/perl5/bin/netdisco-backend"
...

And copied/changed the file:

# docker exec docker-netdisco-backend-1 cat /home/netdisco/perl5/bin/netdisco-backend > netdisco/netdisco-backend-o
# docker exec docker-netdisco-backend-1 cat /home/netdisco/perl5/bin/netdisco-backend > netdisco/netdisco-backend
# chmod +x netdisco/netdisco-backend 
# nano netdisco/netdisco-backend
# diff netdisco/netdisco-backend netdisco/netdisco-backend-o 
8,9d7
< $SIG{CHLD} = 'IGNORE';
< 

Then restarted docker, and tried again:

# docker-compose down && docker-compose up -d
[+] Running 4/4
 ⠿ Container docker-netdisco-backend-1     Removed 0.4s
 ⠿ Container docker-netdisco-web-1         Removed 0.3s
 ⠿ Container docker-netdisco-postgresql-1  Removed 0.3s
 ⠿ Network docker_default                  Removed 0.2s
[+] Running 4/4
 ⠿ Network docker_default                  Created 0.2s
 ⠿ Container docker-netdisco-postgresql-1  Started 0.5s
 ⠿ Container docker-netdisco-web-1         Started 1.7s
 ⠿ Container docker-netdisco-backend-1     Started 1.6s
# ps -AF|grep '[<]defunct>'
# touch netdisco/config/deployment.yml
# ps -AF|grep '[<]defunct>'
# touch netdisco/config/deployment.yml
# ps -AF|grep '[<]defunct>'

Not sure if this is any good solution, or does it brakes something else. But in my case, the features I'm using seems to be working.

Steps to Reproduce

  1. deploy netdisco-docker
  2. change the "netdisco/config/deployment.yml" - file
  3. find any defunct process

Context

I'm using netdisco only for spcific devices, and use my own scheduler (by netdisco-do). So deployment.yml has:

...
schedule:
  discoverall: null
  macwalk: null
  arpwalk: null
  nbtwalk: null
  expire:
    when: '30 23 * * *'
discover_only:
  - 127.0.0.1
...

and "discover_only" list is pediodically changed by my own scheduler-script, causing config file change, which now causes a lot of zombie process over time. I do not use web interface at all, but read the data through rest-api.

Environment

Config info (deployment.yml and docker env settings)

# cat docker-compose.yml|grep -vE '^\s*#|^\s*$'
version: '3.9'
services:
  netdisco-postgresql:
    image: docker.io/netdisco/netdisco:2.055000-postgresql
    volumes:
      - "./netdisco/pgdata:/var/lib/postgresql/data"
    ports:
      - "5433:5432"
    restart: always
  netdisco-backend:
    image: docker.io/netdisco/netdisco:2.055000-backend
    volumes:
      - "./netdisco/nd-site-local:/home/netdisco/nd-site-local"
      - "./netdisco/config:/home/netdisco/environments"
      - "./netdisco/logs:/home/netdisco/logs"
      - "./netdisco/netdisco-backend:/home/netdisco/perl5/bin/netdisco-backend"
    environment:
      NETDISCO_DOMAIN:  discover
      NETDISCO_DB_HOST: netdisco-postgresql
    depends_on:
      - netdisco-postgresql
    dns_opt:
      - 'ndots:0'
      - 'timeout:1'
      - 'retries:0'
      - 'attempts:1'
      - edns0
      - trustad
    restart: always
  netdisco-web:
    image: docker.io/netdisco/netdisco:2.055000-web
    volumes:
      - "./netdisco/nd-site-local:/home/netdisco/nd-site-local"
      - "./netdisco/config:/home/netdisco/environments"
    environment:
      NETDISCO_DOMAIN:  discover
      NETDISCO_DB_HOST: netdisco-postgresql
    ports:
      - "5000:5000"
    depends_on:
      - netdisco-postgresql
    dns_opt:
      - 'ndots:0'
      - 'timeout:1'
      - 'retries:0'
      - 'attempts:1'
      - edns0
      - trustad
    restart: always
# cat netdisco/config/deployment.yml|grep -vE '^\s*#|^\s*$'
database:
  name: 'netdisco'
  user: 'netdisco'
  pass: 'netdisco'
site_local_files: true
no_auth: false
community:
  - public
snmp_auth:
  - tag: v3u1
    user: netdisco
    auth:
      pass: disconet
      proto: SHA
    priv:
      pass: disconet
      proto: AES
schedule:
  discoverall: null
  macwalk: null
  arpwalk: null
  nbtwalk: null
  expire:
    when: '30 23 * * *'
expire_devices: 2
workers:
  tasks: '25'
  timeout: 600
  sleep_time: 1
  min_runtime: 0.5
  max_deferrals: 0
  retry_after: 0
snmptimeout: 300000
snmpretries: 1
path: '/netdisco/'
log: warning
dns:
  max_outstanding: 50
  hosts_file: '/etc/hosts'
  no: ["0.0.0.0/0","::/0"]
discover_only:
  - 127.0.0.1
rc9000 commented 1 year ago

Thanks @hveini for this thorough bug report and even finding a possible solution!

I also did some digging now and was very puzzled by how differently the restarting and general signal handling works in docker.

Looking at the installed signal handlers in /proc, or rather their absence...

(shell in container) /home/netdisco # cat /proc/1/status
Name:   netdisco-backend
...
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000080
SigCgt: 0000000000004000 that is binary 0100000000000000

... I finally found these:

Long story short, to have proper signal handling in a multi-process environment like the netdisco processes, we should use an init process instead of using netdisco-backend as PID 1 directly. This just means adding init:true to both services in docker-compose:

ram@cicd:/tmp/issue49 $ grep -B 3 init docker-compose.yml

  netdisco-backend:
    image: netdisco/netdisco:latest-backend
    init: true
...
  netdisco-web:
    image: netdisco/netdisco:latest-web
    init: true

With this both processes seem to reload fine on config changes, just like when running in a non-containerized environment.