pongasoft / glu

Deployment Automation Platform
Apache License 2.0
518 stars 99 forks source link

agentctl.sh: Sanity check PID_FILE and remove at stop #278

Open sbohrer opened 9 years ago

sbohrer commented 9 years ago

PID files are inherently problematic because PIDs are reused. This means that at start up if we find a PID file and there is a running process with that PID we do nothing. Unfortunately it is possible that our process long since died, PIDs have wrapped and are now getting reused and the process we found is an unrelated process. The problem is even more likely at system start time if we find a PID file from the previous system boot. In this case the system is starting up and spawning many new processes and it is likely that the low numbered PID we received on the previous boot will be near other processes starting in parallel.

To help solve this problem we can remove stale PID files when we shut down cleanly. Additionally we can perform a basic sanity check if we find a PID file and find the PID running to compare that they at least have the command we expect. This commit performs both of these actions.

Additionally PID files are typically written to /var/run or /run which get cleared on reboots. I am not making that change since I do not know if there would be any ramifications to moving the location of the PID file.

Signed-off-by: Shawn Bohrer sbohrer@rgmadvisors.com

sodul commented 9 years ago

Yes, basename is pretty much available everywhere. I use it all the time. http://ss64.com/bash/basename.html

prep would be perfect to find a process id based on that command line argument. However at skyhighnetworks we always have 2 agents running, and I know others also have multiple agents per host: $ pgrep -f ' -Dorg.linkedin.app.name=org.linkedin.glu.agent-server' 10054 11944

The -f option makes it do a pattern match on the entire command line and it return the process ids, so there is no sed/awk/gymnastic. Unfortunately pgrep and pkill are not always present on less common unix versions, though I only have access to Linux and Mac OS X boxes where it is present.

I recommend changing this lines: if [[ "$PS_STAT" -eq "$PID" ]]; then to if [ "$PS_STAT" -eq "$PID" ]; then Since the double square brackets is a bash form. If you want to use the bash form you would rather use the more readable alternatives such as: if (( PS_STAT == PID )); then which does a real numeral test and does not require $, or: if [[ $PS_STAT == $PID ]]; then which does a text equal (quotes are optional when using double square brackets).

sbohrer commented 9 years ago

Since the double square brackets is a bash form.

I normally stay away from bashisms but I'll note this is actually a bash script (#!/bin/bash). I'm happy to change to whatever is believed to be the most readable.

ypujante commented 9 years ago

I don't have a preference personally. I will merge the changes with the next release of glu (unscheduled at this time).

ypujante commented 9 years ago

I merged your changes to the current version of glu thinking that it would work. Unfortunately it is not working. This is the output message during the shutdown of the tutorial:

### Stopping Agent...
Loading config [/Volumes/Vault/deployment/content/glu/org.linkedin.glu.packaging-all-5.5.6/tutorial/distributions/tutorial/agents/org.linkedin.glu.agent-server-agent-1-tutorialZooKeeperCluster-5.5.6/5.5.6/conf/pre_master_conf.sh]...
Loading config [/Volumes/Vault/deployment/content/glu/org.linkedin.glu.packaging-all-5.5.6/tutorial/distributions/tutorial/agents/org.linkedin.glu.agent-server-agent-1-tutorialZooKeeperCluster-5.5.6/5.5.6/conf/master_conf.sh]...
Status: PID file [/Volumes/Vault/deployment/content/glu/org.linkedin.glu.packaging-all-5.5.6/tutorial/distributions/tutorial/agents/org.linkedin.glu.agent-server-agent-1-tutorialZooKeeperCluster-5.5.6/data/logs/org.linkedin.glu.agent-server.pid] not found !!!

As you can see it says PID file not found. And of course the agent is not shutting down.

11234 AgentMain

I need to back out the merge and retest everything before I can release glu...