Closed fmgdias closed 5 years ago
Hi @fmgdias, looks like the code, which process the arguments is just missing. https://github.com/statusengine/module/blob/ba0fafd95ac802d002af21237bead6548feced36/src/statusengine.c#L583-L591 I will check this and come back to you.
Is it possible that the configuration is through round robin balancing too?
At the moment no. Did you run in any performance issues?
I have an extra large demand to install, with approximately 500,000 servicechecks.
So I ran a performance test with the blobslap_client (gearmand benchmark), I ended up dropping the gearmand when the queue hit 1,000,000 jobs, it's strange because after that limit, the server begins to consume memory infinitely, arriving at 40GB in RAM. The performance test showed a limit of 5,000 jobs / s in production (set), and 16,000 jobs / s in consumption (get), increasing the timeout beyond 250mS.
I performed some balancing tests with HAProxy (L4 TCP), and set a 5mS timeout on the client (gearman_client_set_timeout), and it worked fine, because if the client aborted, the next time HAProxy reconnected to another gearmand in another port.
My idea is to dispatch the jobs to 10 gearmands (randomly round robin), and connect all the workers in those 10 gearmands.
For this, I would need to configure the broker with multiple gearmands servers, with the gearman_client_set_timeout option too, similar to what is in the original mod_gearman.
# Client
cd /tmp/gearmand-1.1.18/benchmark/ ;
/usr/bin/time ./blobslap_client -b -m 1024 -M 1024 -c 10 -n 100000 &
/usr/bin/time ./blobslap_client -b -m 1024 -M 1024 -c 10 -n 100000 &
/usr/bin/time ./blobslap_client -b -m 1024 -M 1024 -c 10 -n 100000 &
# Worker
gearman -w -f gb > /dev/null 2> /dev/null ;
500k service checks - sounds like a serious project.
The performance test showed a limit of 5,000 jobs / s in production (set), and 16,000 jobs / s in consumption (get), increasing the timeout beyond 250mS.
We run similar tests and measured how long it took Statusengine Worker to store 1,000,000 records into the database. We also noticed some performance issues with gearmand (to long read/get times). We decided to use bulk messages in the queue. At the moment the bulk feature is still in a development branch - basically because I don't have much time to test it. I think you should definitely go with the Bulk-Messages Broker Version. The Statusengine Worker is able to consume bulk messages, so you just need to swap out the Broker Module.
I would also recommend to split up your system into a multi node cluster. Some like 3 or 5 systems each running a Nagios/Naemon Core, Gearman-Job-Server, Statusengine Worker and a CrateDB database. I guess otherwise you will struggle around with memory leak issues, slow restart performance of Naemon core and so on... What are your thoughts on this?
I performed some balancing tests with HAProxy (L4 TCP), and set a 5mS timeout on the client (gearman_client_set_timeout), and it worked fine, because if the client aborted, the next time HAProxy reconnected to another gearmand in another port.
If you could use HAProxy for load balancing across multiple gearman servers, does the broker than still need a round robin
feature? Or just the gearman_client_set_timeout
?
I will fix the gearman_server_list issue and check how hard it will be to add a random round robin mode as well.
Nice to know: We are also working on a new broker module, which will use RabbitMQ. https://github.com/statusengine/broker Unfortunately it is not ready for production yet, but may be also interesting for you.
gearman_server_list
works now. I fixed this in the bulk branch acfd9abc228db71f34d450db62d056f74d6d1e7b. It was just one line of code, so you can easy backport this to your current used version if you want.
This is the output of my tests: https://cloud.nook24.eu/index.php/s/crx3f8MY7cwKeLW
the patch you entered was great (HAProxy will no longer be needed). I think it's also important to add an option to gearman_client_set_timeout, because the server may be slow but not totally unavailable.
Now, I had 3 doubts:
1) what kind of feedback the broker, delivery to Nagios after performing a check? I ask this, because if Nagios switches from gearmand server to another, the result of a check, can it be on another gearmand server, which Nagios will not be reading, or will Nagios have the ability to read two gearmand simultaneously?
2) I noticed that even when gearmand is offline, Nagios continues to run host checks. I imagine this is being done by his local worker. There is the possibility of disabling this in Nagios, I would like the check to be done exclusively by the worker. (Or did I misunderstand)?
3) before this project, I did not know the cratedb, and I am uncomfortable to use it (for lack of knowledge of mine), I was thinking of using kairosdb (with scylladb), or even influxdb, is there any good reason to use cratedb?
thank very much for the support , I'm hoping for the success of this project.
PS1: After that, you can close this issue
PS2: I have designed a snmp collector, out of nagios, which can run up to 150,000 snmpget per second, using only 16 CPU, 8 RAM, no database, the results are sent to a text file. I think I will soon publish an article about it, do you think it could be useful in your project?
I can add gearman_client_set_timeout
as well.
How ever, I guess you are mixing up Statusengine Project with mod_gearman.
Statusengine is a project to store Nagios and Naemon events to different database backends. At the moment it supports MySQL, CrateDB and Redis for event data. Performance data (time series) can be stored to Graphite, CrateDB, MySQL and Elasticsearch. Statusengine has nothing to do with the execution of host or service checks.
To distribute the workload of huge Nagios/Naemon installations (and the work load on the database), Statusengine can be distributed across multiple nodes. Like I mentioned earlier... See also: https://statusengine.org/getting_started/#overview
_modgearman on the other hand is a tool to distribute execution of host and service checks across different nodes. It helps to distribute the workload of your Naemon/Nagios server. See also: https://labs.consol.de/de/nagios/mod-gearman/index.html
Both are using gearmand as in memory Queue. Statusengine Broker export Naemon events to gearmand (Statusengine Worker consumes these) mod_gearman export Naemon checks that needs to be executed to the gearmand server (mod_gearman_worker consumes these)
In addition both projects can be used at the same time. So you can use mod_gearman to spread the workload of check execution and still use Statusengine to store the results (or only the time series data for Grafana) to a database. (I also setup systems like this from time to time)
At the moment Statusengine does not support KairosDB, ScyllaDB or InfluxDB... The code architecture allows it to add new database backends easily. But I had personally no use case for adding more databases...
PS2: I have designed a snmp collector, out of nagios, which can run up to 150,000 snmpget per second, using only 16 CPU, 8 RAM, no database, the results are sent to a text file. I think I will soon publish an article about it, do you think it could be useful in your project?
Of course! I like to read and talk about cool Nagios setups.
thx man !
Sorry to comment on a closed thread... it just happents to addresses exactly the info I was after in reference to mod_gearman:
both projects can be used at the same time. So you can use mod_gearman to spread the workload of check execution and still use Statusengine to store the results (or only the time series data for Grafana) to a database. (I also setup systems like this from time to time)
Are there any special considerations specific to setting up mod_gearman along side Statusengine, or are the standard [documented] mod_gearman approaches fine?
@bshaw
Are there any special considerations specific to setting up mod_gearman along side Statusengine, or are the standard [documented] mod_gearman approaches fine?
Just go with the default documentation.
I am used to build mod_gearman from source code but you can also use the provided packages.
If you build mod_gearman form source, you need to install libnaemon and naemon-dev first.
dpkg -i libnaemon_1.0.9.ubuntu18.04.amd64.deb naemon-dev_1.0.9.ubuntu18.04.amd64.deb
You can download the required packages on the naemon project page: http://www.naemon.org/download/ Even if this is not a Statusengine related topic I will try to help you with further questions
Aewesome, thanks! I think I have the mod_gearman stuff under control but will ping you (separate thread) if I run into any snags.
Correct me if I'm wrong, but the option "gearman_server_list" is not implemented ?! In the manual we have the information that it is used as failover. Is it possible that the configuration is through round robin balancing too? It would be interesting to have these three functions:
failover, gearman_server_list = (active/passive) duplicate, gearman_dup_server_list = (fan-out, send to all) round-robin, gearman_balance_server_list = (randon)
Source: https://statusengine.org/broker/#broker-options