olafz / percona-clustercheck

Script to make a proxy (ie HAProxy) capable of monitoring Percona XtraDB Cluster nodes properly. The clustercheck script is distributed under the BSD license.
BSD 3-Clause "New" or "Revised" License
173 stars 108 forks source link

Add flock to prevent concurrent clustercheck runs using up connections #11

Open tomgidden opened 9 years ago

tomgidden commented 9 years ago

When one of our nodes got a bit tied up due to a disk space issue, clustercheck started filling up the ps list, waiting on mysql queries.

Wrapping the whole routine in this advice from flock(1):

       (
         flock -n 9 || exit 1
         # ... commands executed under lock ...
       ) 9>/var/lock/mylockfile

As a result of this extra nesting, the majority of the file has been indented.

I've also pulled the HTTP responses out into functions to avoid repetition. The Content-Length calculations might be slightly off, as I'm not sure whether or not all the \r\ns are counted or not, so it just uses a string length check.

olafz commented 9 years ago

Thanks for the pull request. I have a question though. Did you see many clustercheck-processes? Because there already is a timeout of 10 seconds in the execution of the mysql command, after which it exits.

If the problem is a filling of ps, this won't solve your problem:

flock -w $TIMEOUT 9 || report_fail "clustercheck is blocked up."

With or without this change, there should never be any clustercheck-process running for more than 10 seconds. But instead of waiting for the mysql command, it will now wait for a file lock. But the ps-list still increases?

tomgidden commented 9 years ago

As a production cluster, I was in a bit of a rush and didn't stop to investigate this incidental flaw, but there were ten to twenty clustercheck processes in ps that were apparently blocking on mysql commands, and that was the case for a lot longer than ten seconds. At the time, it was possible to connect to mysqld, but any query -- even a simple SHOW STATUS LIKE ... -- would block. Now, the fact that the node was so messed up that it blocked on such a straightforward query is a different matter entirely ;)

To be clear, the mysql commands were connecting to mysqld successfully (and instantly) so --connect-timeout was not relevant. And, there was no query timeout set on those calls by default... which, admittedly, is another problem!

Suffice to say, it was a fairly screwed-up situation that shouldn't have happened, but the numerous clustercheck-launched mysqls all blocking was the problem here.

Anyway, the flock -w $TIMEOUT with a TIMEOUT equal to 10 should mean a clustercheck process should wait on the flock call for up to ten seconds, and then exit 1 if it fails to acquire the lock. In this scenario, it'd mean I'd still have one clustercheck blocking on the query, but at least I wouldn't be getting "Too many connections" just from clusterchecks alone.