Child processes aren't being completely killed before new child processes are created

morganwalker commented 6 years ago

Envconsul version

# envconsul -v
envconsul v0.7.3 (daa2947)

Configuration

vault.hcl

vault {
  address = "https://domain.com:443"
  grace = "15s"
  renew_token = true

  retry {
    enabled = true
    attempts = 5
    backoff = "250ms"
    max_backoff = "20s"
  }

  ssl {
    enabled = true
    verify = true
    server_name = "domain.com"
  }
}
exec {
  kill_signal = "SIGINT"
  kill_timeout = "10s"
  splay = "2s"

  env {
    blacklist = ["VAULT_*"]
  }
}

wait {
  min = "5s"
  max = "10s"
}
log_level = "info"
upcase = true

postgresql.hcl

secret {
  no_prefix = true
  path = "database/creds/pg-service"
  format = "PG_{{ key }}"
}

Command

/usr/local/bin/envconsul -config=/etc/envconsul/postgresql.hcl -config=/etc/envconsul/vault.hcl /opt/start.sh

Debug output

debug logs from example in question debug logs from other apps experiencing similar behavior

Expected behavior

The original child process should have been completely terminated before spawning a new child process.

Actual behavior

The original child process is still bound to addresses and denies new child process from starting.

Steps to reproduce

We're using envconsul to spawn any process that needs postgres credentials. Our base image is built off of alpine:3.7, which installs envconsul 0.7.3 and runs:

ENTRYPOINT ["/usr/local/bin/entrypoint"]

CMD ["/bin/bash"]

where our entrypoint performs:

#!/bin/bash

if [ "$(ls -A /etc/envconsul)" ]; then
  ARGS=()
  for i in `find /etc/envconsul -name *.hcl -type f`; do
    echo "Loading config: $i"
    FILE=""
    DEREF=$(readlink -f $i)
    if [[ "$?" -eq 0 ]]; then FILE=$DEREF; else FILE=$i; fi
    ARGS+=("-config=$FILE")
  done
  ARGS+=("$@")
  echo "envconsul configured, using configured credentials"
  exec /usr/local/bin/envconsul "${ARGS[@]}"
else
  echo "envconsul not configured, running without vault credentials"
  exec "$@"
fi

Our postgres-exporter image then simply runs CMD [ "/opt/start.sh" ]% and when the container starts we'll see:

    1 root       0:00 /usr/local/bin/envconsul -config=/etc/envconsul/postgresql.hcl -config=/etc/envconsul/vault.hcl /opt/start.sh
   26 root       0:00 {start.sh} /bin/bash /opt/start.sh

We've played around with tweaking the vault configs exec splay, exec kill_signal, exec kill_timeout, wait mins and maxes, and -once, but so far whatever combination we've tried hasn't worked. What do we need to do in order to successfully kill the original child process so the successor can spawn?

eikenb commented 5 years ago

Hey @morganwalker, thanks for taking the time to submit a ticket.

I know it's been a long time, sorry for that, but if you could come up with a simplified, minimal version of the scenario which triggers this issue it would be very helpful.

Thanks.

morganwalker commented 5 years ago

@eikenb Thanks for checking in but we can go ahead and close this out.

eikenb commented 5 years ago

@morganwalker That's great!

Mind if I ask what happened? Why it is no longer an issue?

Thanks.

morganwalker commented 5 years ago

We actually slimmed down our infrastructure stack quite a bit and no longer require Vault or envconsul.

eikenb commented 5 years ago

@lopfe .. you :+1:'d this... are you still seeing it? Could you put together a simpler example to reproduce it?

eikenb commented 4 years ago

I've looked over the code pretty closely and don't see a flaw in it. It blocks to wait on the child process exiting before moving on to start the new child process.

Could the processes in these cases have a forked child process themselves which doesn't exit until after the parent process? This could cause this issue as envconsul would be notified when the parent process exits and continue on to start the new process, but the child'd child (grandchild?) process would still be hanging around?

What is really needed is a way to reproduce this with a minimal setup that can be reproduced.

nvllsvm commented 2 years ago

Encountered this issue today and found a minimal way to reproduce it. https://gist.github.com/nvllsvm/2c0e0561a3e472c9a53ba3bcd3be21eb

eikenb commented 2 years ago

Thanks for repro @nvllsvm! Marked this to be looked into for the next release (which I'll be starting work on after I finish with the current consul-esm bugfix work).

eikenb commented 2 years ago

I've taken a closer look at the repro example and it shows how a fork will trigger an early exit of the process. This behaves as it should ane means that a fork/exec pattern can't trigger this issue.

This issue is around the idea that the managed process wasn't being stopped before it was re-started and I still don't really think that is the bug. IMO the original issue sounds like it might have been an issue with the OS not releasing the port in time.

I'm going to close this as I still see no problems with the behavior (no repro) and code looks good.

hashicorp / envconsul