nook24 / statusengine

New PHP based MySQL Backend for Naemon and Nagios 4 + responsive web frontend
https://statusengine.org/
GNU General Public License v2.0
16 stars 8 forks source link

Worker/Childs are left behind if main app crashes #14

Closed dhoffend closed 8 years ago

dhoffend commented 8 years ago

When using the provided start script, several worker process are being started and forked into background without any real parent pid.

start-stop-daemon creates a pid file of the main php script which then gets closed after all worker processes are being created.

When calling the init script with the status parameter you get an error that statusengine is not running cause the pid file created by start-stop-daemon has already ended but the forked childs still live on.

When you try to stop statusengine it tells you that statusengine has already been stopped (the process to the pid file .... same as status).

dhoffend commented 8 years ago

Under certain conditions (for example reboot) the statusengine process comes up without parent pid / master process

See example here:

root      1520     1  0 17:11 ?        00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root      1521     1  0 17:11 ?        00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root      1522     1  0 17:11 ?        00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root      1523     1  0 17:11 ?        00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root      1524     1  0 17:11 ?        00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
dhoffend commented 8 years ago

Okay found a problem It seems if the mysql connection cannot be established, all the worker processes are already spawned but the main process crashes with ERROR

dhoffend commented 8 years ago

Okay, the bug is not the init scripts, it's the statusegine application leaving childs/worker behind if the main process exits

dhoffend commented 8 years ago

You can see here that statusengine is not running

# ps -ef | grep statusengine
root      4663  2063  0 17:56 pts/0    00:00:00 grep --color=auto statusengine

I'm shutting down the mysql-server and even delete the socket file (yes I know that's not nice).

# /etc/init.d/mysql stop
 * Stopping MySQL database server mysqld [ OK ]
# rm /var/run/mysqld/mysqld.sock

When we now start the statusengine application it will crash with an error that mysql connection can't be established. (Maybe there are also other cases where statusengine might crash on startup)

# /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w

Welcome to CakePHP v2.5.4 Console
---------------------------------------------------------------
App : app
Path: /opt/statusengine/cakephp/app/
---------------------------------------------------------------
Starting Statusengine version: 1.5.1...
THIS IS LEGACY MODE!
Error: Database connection "Mysql" is missing, or could not be created.
#0 /opt/statusengine/cakephp/lib/Cake/Model/Datasource/DboSource.php(260): Mysql->connect()
#1 /opt/statusengine/cakephp/lib/Cake/Model/ConnectionManager.php(105): DboSource->__construct(Array)
#2 /opt/statusengine/cakephp/lib/Cake/Model/Model.php(3524): ConnectionManager::getDataSource('legacy')
#3 /opt/statusengine/cakephp/lib/Cake/Model/Model.php(1136): Model->setDataSource('legacy')
#4 /opt/statusengine/cakephp/lib/Cake/Model/Model.php(3551): Model->setSource('instances')
#5 /opt/statusengine/cakephp/lib/Cake/Model/Model.php(1369): Model->getDataSource()
#6 /opt/statusengine/cakephp/lib/Cake/Model/Model.php(1553): Model->schema()
#7 /opt/statusengine/cakephp/app/Console/Command/StatusengineLegacyShell.php(2333): Model->create()
#8 /opt/statusengine/cakephp/app/Console/Command/StatusengineLegacyShell.php(2672): StatusengineLegacyShell->createInstance()
#9 /opt/statusengine/cakephp/app/Console/Command/StatusengineLegacyShell.php(210): StatusengineLegacyShell->forkWorker()
#10 /opt/statusengine/cakephp/lib/Cake/Console/Shell.php(440): StatusengineLegacyShell->main()
#11 /opt/statusengine/cakephp/lib/Cake/Console/ShellDispatcher.php(209): Shell->runCommand('-w', Array)
#12 /opt/statusengine/cakephp/lib/Cake/Console/ShellDispatcher.php(66): ShellDispatcher->dispatch()
#13 /opt/statusengine/cakephp/app/Console/cake.php(47): ShellDispatcher::run(Array)
#14 {main}

After the main script has crashed you can check your process list and find many orphanded statusengine processes without parent-pid (aka master) idling around.

# ps -ef | grep -v grep | grep statusengine
root      5013     1  0 17:58 pts/0    00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root      5014     1  0 17:58 pts/0    00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root      5015     1  0 17:58 pts/0    00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root      5016     1  0 17:58 pts/0    00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root      5017     1  0 17:58 pts/0    00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
nook24 commented 8 years ago

Yes thats right, if the parent process of statusengine crashs, the child processes are orphanded. You need to kill them manually :/

It looks like, that your MySQL credentials are wrong in https://github.com/nook24/statusengine/blob/master/cakephp/app/Config/database.php#L78-L86 Have you tried, if your are able to login with the given username and password?

mysql -u statusengine -p
Enter password:
mysql> use naemon
Database changed
mysql> select current_state from naemon_servicestatus LIMIT 0, 1;
+---------------+
| current_state |
+---------------+
|             3 |
+---------------+
1 row in set (0.00 sec)

mysql>

If this is working, your user has the permissions for the database. May be you are using nagios instead of naemon as database name...

Please make sure, that you are using the current database schema: https://github.com/nook24/statusengine/blob/master/cakephp/app/Plugin/Legacy/Config/Schema/legacy_schema_innodb.php You can update/create your database for Statusengine by executing:

/opt/statusengine/cakephp/app/Console/cake schema update --plugin Legacy --file legacy_schema_innodb.php --connection legacy

To debug Statusengine, you should run it in foreground mode using:

/opt/statusengine/cakephp/app/Console/cake statusengine_legacy -w

I guess you did this to get the error message you pasted above :)

At the moment Statusengine only silently catches the MySQL error "General error: 2006 MySQL server has gone away" https://github.com/nook24/statusengine/blob/master/cakephp/app/Plugin/Legacy/Model/LegacyAppModel.php#L198-L203 to reconnect if you may be restart your MySQL server. On every other MySQL error Statusengine will crash :(

Please check that there are no orphanded child processes are running on your system if you fire up Statusengine

dhoffend commented 8 years ago

In this case it was a cluster configuration where statusengine start/stop/status is controlled via pacemaker. And we found a situation where orphaned statusengine childs were left behind.

Since the init-script doesn't do any additional checks for orphaned childs and the childs themself don't care if the master connection is lost I opened this issue.

IMO the childs should be killed before a hard exit (catched exception) or the childs should kill themself if the connection to it's master process is lost. Just leaving them behind can create additional problems.

nook24 commented 8 years ago

Since the init-script doesn't do any additional checks for orphaned childs and the childs themself don't care if the master connection is lost I opened this issue.

Thats true, let me check if i can hack a quick fix for you, sounds not hard to check if the parent process is dead or not...

dhoffend commented 8 years ago

Another case would be during startup where the scripts are started in the wrong way (statusengine before MySQL).

nook24 commented 8 years ago

I patched Statusengine right now and it looks not to bad for the first try:

root@statusengine-demo:/opt/statusengine/cakephp/app# service statusengine start
Starting Statusengine

Check that everything is running and happy:

root@statusengine-demo:/opt/statusengine/cakephp/app# ps -eaf |grep status
root     10689 10656  0 18:37 pts/4    00:00:00 tailf /var/log/statusengine.log
root     10835 10707  0 18:37 pts/5    00:00:00 watch gearadmin --status
root     11266     1  7 18:40 ?        00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root     11267 11266  2 18:40 ?        00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root     11268 11266  1 18:40 ?        00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root     11269 11266  1 18:40 ?        00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root     11270 11266  1 18:40 ?        00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root     11271 11266  0 18:40 ?        00:00:00 /usr/bin/php5 -q /opt/statusengine/cakephp/app/Console/cake.php -working /opt/statusengine/cakephp/app statusengine_legacy -w
root     11279 10077  0 18:40 pts/1    00:00:00 grep --color=auto status

Kill the parent process and see what happens:

root@statusengine-demo:/opt/statusengine/cakephp/app# kill -9 11266; sleep 1; ps -eaf  |grep status
root     10689 10656  0 18:37 pts/4    00:00:00 tailf /var/log/statusengine.log
root     10835 10707  0 18:37 pts/5    00:00:00 watch gearadmin --status
root     11319 10077  0 18:40 pts/1    00:00:00 grep --color=auto status

/var/log/statusengine.log reports now:

[11270] My parent process is gone I guess I am orphaned and will exit now!
[11268] My parent process is gone I guess I am orphaned and will exit now!
[11271] My parent process is gone I guess I am orphaned and will exit now!
[11269] My parent process is gone I guess I am orphaned and will exit now!
[11267] My parent process is gone I guess I am orphaned and will exit now!

Check the init script:

root@statusengine-demo:/opt/statusengine/cakephp/app# service statusengine status
No /usr/bin/php5 found running; none killed.
Statusengine is not running

You can test this on your environment if you checkout the branch: https://github.com/nook24/statusengine/tree/issue_14

Unfortunately my init script is not the best and need to be ported to systemd i guess... But pacemaker should launch all your processes, not SysVinit.

dhoffend commented 8 years ago

Looks good so far. Now I just have to get the right starting order so it doesn't crash on startup.

Btw. The fact that the master process has died is not visible in the logfile. You only see this if you start it manually. If you start it via init script and the process dies due to unavailable database connectivity the start-stop-daemon doesn't say anything or writes anything to the log.

I would suggest if you don't catch the exception to do something with it that at least stuff gets logged so it can be reviewed if something happens.

nook24 commented 8 years ago

I add the boot order to the documentation: https://statusengine.org/documentation.php#boot-order

I will check if i can add the errors to the logfile...

nook24 commented 8 years ago

@dhoffend can you please checkout my last commit SHA: 1e7b44b27fbf226652dd32097bf4fa21f241a5fd ? I guess this is what you are looking for.

dhoffend commented 8 years ago

Looks better, but now the error output on the commandline is missing. Since it's a critical error it should be printed to STDERR.

Here's the log output

exception 'MissingConnectionException' with message 'Database connection "Mysql" is missing, or could not be created.' in /opt/statusengine/cakephp/lib/Cake/Model/Datasource/Database/Mysql.php:191
Stack trace:
#0 /opt/statusengine/cakephp/lib/Cake/Model/Datasource/DboSource.php(260): Mysql->connect()
#1 /opt/statusengine/cakephp/lib/Cake/Model/ConnectionManager.php(105): DboSource->__construct(Array)
#2 /opt/statusengine/cakephp/lib/Cake/Model/Model.php(3524): ConnectionManager::getDataSource('legacy')
#3 /opt/statusengine/cakephp/lib/Cake/Model/Model.php(1136): Model->setDataSource('legacy')
#4 /opt/statusengine/cakephp/lib/Cake/Model/Model.php(3551): Model->setSource('instances')
#5 /opt/statusengine/cakephp/lib/Cake/Model/Model.php(1369): Model->getDataSource()
#6 /opt/statusengine/cakephp/lib/Cake/Model/Model.php(1553): Model->schema()
#7 /opt/statusengine/cakephp/app/Console/Command/StatusengineLegacyShell.php(2331): Model->create()
#8 /opt/statusengine/cakephp/app/Console/Command/StatusengineLegacyShell.php(2670): StatusengineLegacyShell->createInstance()
#9 /opt/statusengine/cakephp/app/Console/Command/StatusengineLegacyShell.php(208): StatusengineLegacyShell->forkWorker()
#10 /opt/statusengine/cakephp/lib/Cake/Console/Shell.php(440): StatusengineLegacyShell->main()
#11 /opt/statusengine/cakephp/lib/Cake/Console/ShellDispatcher.php(209): Shell->runCommand('-w', Array)
#12 /opt/statusengine/cakephp/lib/Cake/Console/ShellDispatcher.php(66): ShellDispatcher->dispatch()
#13 /opt/statusengine/cakephp/app/Console/cake.php(47): ShellDispatcher::run(Array)
#14 {main}[5114] My parent process is gone I guess I am orphaned and will exit now!
[5115] My parent process is gone I guess I am orphaned and will exit now!
[5116] My parent process is gone I guess I am orphaned and will exit now!
[5117] My parent process is gone I guess I am orphaned and will exit now!
[5118] My parent process is gone I guess I am orphaned and will exit now!

I'm wondering why do you try to create a new log handler/resource? Can't you use your Console/Command/Task/LogfileTask.php logger?

For logging stuff I'm opening a different issue with other things in mind.

nook24 commented 8 years ago

I'm wondering why do you try to create a new log handler/resource? Can't you use your Console/Command/Task/LogfileTask.php logger?

My plan is to remove this class...

If you want i can enable stderr output as well

nook24 commented 8 years ago

Merged into new Version 1.6.0