smtc0097 / mysql-cacti-templates

Automatically exported from code.google.com/p/mysql-cacti-templates
GNU General Public License v2.0
0 stars 0 forks source link

Cacti poller stalls if ssh process stalls #162

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
I'm not entirely sure if this is a "Better Cacti Graphs" problem or a generic 
Cacti problem.  Maybe it's both.

We are occasionally seeing an ssh process stalling on the Cacti box.  It looks 
like this:

ssh -q -o ConnectTimeout 10 -o StrictHostKeyChecking no cactiuser@192.168.1.1 
-p 22 -i /usr/local/etc/cacti/id_rsa wget   -U Cacti/1.0 -q -O - -T 5 
"http://localhost/server-status?auto"

The memcached check is also stalled currently on this system.  I've seen them 
stalled for up to an hour before I caught them.  Presumably they would last 
longer.

The server it's trying to check does not currently have an appropriate ssh key 
so these queries haven't been working anyway.  If I try that command myself it 
prompts me for a password but this isn't the problem or it would happen every 
time.  I have only ever seen it happen to this server.

It doesn't happen immediately or reliably but once the ssh process has stalled, 
every time the poller gets up to that host it stalls as well and doesn't 
process any hosts after that.  This means that all the graphs for hosts that 
were created after that one (i.e, have a higher Host ID) stop working and we 
get an extra couple of processes running on the Cacti machine every 5 minutes.

The ss_get_by_ssh.php script that spawned the stalled ssh process has a write 
lock on the cache file and the subsequent ones have it open for writing but 
with no write lock.

My suspicion is that this is the reason for the poller stalling.  Cacti has no 
timeout for local scripts and ss_get_by_ssh.php has no timeout for getting a 
write lock on the cache file.

Killing the ssh process (or all of them if multiple have stalled) starts 
everything working again.

Reproducing the exact problem is difficult.  I can't even reliably manage it on 
the systems we have here.  I just have to wait until it happens.  Creating a 
Cacti setup using ss_get_by_ssh.php with a host with no SSH key may work.

Reproducing something that looks like this issue is easy.  I created a simple 
PHP script that opened one of the cache files with a write lock.

<?php
  $handle = fopen("/tmp/192.168.1.1_apache_localhost__cacti_stats.txt", "r+");
  flock($handle, LOCK_EX);
  sleep(1800);
  flock($handle, LOCK_UN);
?>

Run this and wait for the poller to run from cron.

Replacing curl on a target system with a script that did sleep(1800) would also 
work.

Original issue reported on code.google.com by ladadad...@gmail.com on 25 Oct 2010 at 4:27

GoogleCodeExporter commented 8 years ago
I'm not sure how to address this.  Have you learned anything more about the 
problem?

I don't think that a timeout on the lock call is supported everywhere.

Original comment by baron.schwartz on 15 Jan 2011 at 6:04