The option gearman_server_list, Not Working !

fmgdias commented 5 years ago

Correct me if I'm wrong, but the option "gearman_server_list" is not implemented ?! In the manual we have the information that it is used as failover. Is it possible that the configuration is through round robin balancing too? It would be interesting to have these three functions:

failover, gearman_server_list = (active/passive) duplicate, gearman_dup_server_list = (fan-out, send to all) round-robin, gearman_balance_server_list = (randon)

Source: https://statusengine.org/broker/#broker-options

gearman_server_list: 
A list of Gearman-Job-Servers separated by comma as failover servers.
gearman_server_list=127.0.0.1:4730,192.168.10.5:4730

gearman_dup_server_list: 
A list of Gearman-Job-Servers separated by comma. 
All records will be pushed to all servers. 
gearman_dup_server_list=127.0.0.1:4730,192.168.10.5:4730

nook24 commented 5 years ago

Hi @fmgdias, looks like the code, which process the arguments is just missing. https://github.com/statusengine/module/blob/ba0fafd95ac802d002af21237bead6548feced36/src/statusengine.c#L583-L591 I will check this and come back to you.

Is it possible that the configuration is through round robin balancing too?

At the moment no. Did you run in any performance issues?

fmgdias commented 5 years ago

I have an extra large demand to install, with approximately 500,000 servicechecks.

So I ran a performance test with the blobslap_client (gearmand benchmark), I ended up dropping the gearmand when the queue hit 1,000,000 jobs, it's strange because after that limit, the server begins to consume memory infinitely, arriving at 40GB in RAM. The performance test showed a limit of 5,000 jobs / s in production (set), and 16,000 jobs / s in consumption (get), increasing the timeout beyond 250mS.

I performed some balancing tests with HAProxy (L4 TCP), and set a 5mS timeout on the client (gearman_client_set_timeout), and it worked fine, because if the client aborted, the next time HAProxy reconnected to another gearmand in another port.

My idea is to dispatch the jobs to 10 gearmands (randomly round robin), and connect all the workers in those 10 gearmands.

For this, I would need to configure the broker with multiple gearmands servers, with the gearman_client_set_timeout option too, similar to what is in the original mod_gearman.

# Client
cd /tmp/gearmand-1.1.18/benchmark/ ;
/usr/bin/time ./blobslap_client -b -m 1024 -M 1024 -c 10 -n 100000 &
/usr/bin/time ./blobslap_client -b -m 1024 -M 1024 -c 10 -n 100000 &
/usr/bin/time ./blobslap_client -b -m 1024 -M 1024 -c 10 -n 100000 &
# Worker
gearman -w -f gb > /dev/null 2> /dev/null ;

nook24 commented 5 years ago

500k service checks - sounds like a serious project.

The performance test showed a limit of 5,000 jobs / s in production (set), and 16,000 jobs / s in consumption (get), increasing the timeout beyond 250mS.

We run similar tests and measured how long it took Statusengine Worker to store 1,000,000 records into the database. We also noticed some performance issues with gearmand (to long read/get times). We decided to use bulk messages in the queue. At the moment the bulk feature is still in a development branch - basically because I don't have much time to test it. I think you should definitely go with the Bulk-Messages Broker Version. The Statusengine Worker is able to consume bulk messages, so you just need to swap out the Broker Module.

I would also recommend to split up your system into a multi node cluster. Some like 3 or 5 systems each running a Nagios/Naemon Core, Gearman-Job-Server, Statusengine Worker and a CrateDB database. I guess otherwise you will struggle around with memory leak issues, slow restart performance of Naemon core and so on... What are your thoughts on this?

I performed some balancing tests with HAProxy (L4 TCP), and set a 5mS timeout on the client (gearman_client_set_timeout), and it worked fine, because if the client aborted, the next time HAProxy reconnected to another gearmand in another port.

If you could use HAProxy for load balancing across multiple gearman servers, does the broker than still need a round robin feature? Or just the gearman_client_set_timeout?

I will fix the gearman_server_list issue and check how hard it will be to add a random round robin mode as well.

Nice to know: We are also working on a new broker module, which will use RabbitMQ. https://github.com/statusengine/broker Unfortunately it is not ready for production yet, but may be also interesting for you.

nook24 commented 5 years ago

gearman_server_list works now. I fixed this in the bulk branch acfd9abc228db71f34d450db62d056f74d6d1e7b. It was just one line of code, so you can easy backport this to your current used version if you want. This is the output of my tests: https://cloud.nook24.eu/index.php/s/crx3f8MY7cwKeLW

fmgdias commented 5 years ago

the patch you entered was great (HAProxy will no longer be needed). I think it's also important to add an option to gearman_client_set_timeout, because the server may be slow but not totally unavailable.

Now, I had 3 doubts:

1) what kind of feedback the broker, delivery to Nagios after performing a check? I ask this, because if Nagios switches from gearmand server to another, the result of a check, can it be on another gearmand server, which Nagios will not be reading, or will Nagios have the ability to read two gearmand simultaneously?

2) I noticed that even when gearmand is offline, Nagios continues to run host checks. I imagine this is being done by his local worker. There is the possibility of disabling this in Nagios, I would like the check to be done exclusively by the worker. (Or did I misunderstand)?

3) before this project, I did not know the cratedb, and I am uncomfortable to use it (for lack of knowledge of mine), I was thinking of using kairosdb (with scylladb), or even influxdb, is there any good reason to use cratedb?

thank very much for the support , I'm hoping for the success of this project.

PS1: After that, you can close this issue

PS2: I have designed a snmp collector, out of nagios, which can run up to 150,000 snmpget per second, using only 16 CPU, 8 RAM, no database, the results are sent to a text file. I think I will soon publish an article about it, do you think it could be useful in your project?

nook24 commented 5 years ago

I can add gearman_client_set_timeout as well.

How ever, I guess you are mixing up Statusengine Project with mod_gearman.

Statusengine is a project to store Nagios and Naemon events to different database backends. At the moment it supports MySQL, CrateDB and Redis for event data. Performance data (time series) can be stored to Graphite, CrateDB, MySQL and Elasticsearch. Statusengine has nothing to do with the execution of host or service checks.

To distribute the workload of huge Nagios/Naemon installations (and the work load on the database), Statusengine can be distributed across multiple nodes. Like I mentioned earlier... See also: https://statusengine.org/getting_started/#overview

_modgearman on the other hand is a tool to distribute execution of host and service checks across different nodes. It helps to distribute the workload of your Naemon/Nagios server. See also: https://labs.consol.de/de/nagios/mod-gearman/index.html

Both are using gearmand as in memory Queue. Statusengine Broker export Naemon events to gearmand (Statusengine Worker consumes these) mod_gearman export Naemon checks that needs to be executed to the gearmand server (mod_gearman_worker consumes these)

In addition both projects can be used at the same time. So you can use mod_gearman to spread the workload of check execution and still use Statusengine to store the results (or only the time series data for Grafana) to a database. (I also setup systems like this from time to time)

At the moment Statusengine does not support KairosDB, ScyllaDB or InfluxDB... The code architecture allows it to add new database backends easily. But I had personally no use case for adding more databases...

PS2: I have designed a snmp collector, out of nagios, which can run up to 150,000 snmpget per second, using only 16 CPU, 8 RAM, no database, the results are sent to a text file. I think I will soon publish an article about it, do you think it could be useful in your project?

Of course! I like to read and talk about cool Nagios setups.

fmgdias commented 5 years ago

thx man !

bshaw commented 5 years ago

Sorry to comment on a closed thread... it just happents to addresses exactly the info I was after in reference to mod_gearman:

both projects can be used at the same time. So you can use mod_gearman to spread the workload of check execution and still use Statusengine to store the results (or only the time series data for Grafana) to a database. (I also setup systems like this from time to time)

Are there any special considerations specific to setting up mod_gearman along side Statusengine, or are the standard [documented] mod_gearman approaches fine?

nook24 commented 5 years ago

@bshaw

Are there any special considerations specific to setting up mod_gearman along side Statusengine, or are the standard [documented] mod_gearman approaches fine?

Just go with the default documentation.

I am used to build mod_gearman from source code but you can also use the provided packages.

If you build mod_gearman form source, you need to install libnaemon and naemon-dev first.

dpkg -i libnaemon_1.0.9.ubuntu18.04.amd64.deb naemon-dev_1.0.9.ubuntu18.04.amd64.deb

You can download the required packages on the naemon project page: http://www.naemon.org/download/ Even if this is not a Statusengine related topic I will try to help you with further questions

bshaw commented 5 years ago

Aewesome, thanks! I think I have the mod_gearman stuff under control but will ping you (separate thread) if I run into any snags.

statusengine / module

The option gearman_server_list, Not Working ! #4