mocdaniel / dashing-icinga2

Dashing dashboard for Icinga 2 using the REST API
MIT License
204 stars 47 forks source link

Performance Issue - many jobs run parallel #36

Closed blacks7 closed 7 years ago

blacks7 commented 7 years ago

Good morning,

we differ our dashing checks an many .rb file (30 files) to isolate the code for every Service in a separat file. When we make this, we get Performance issues on icinga service with load of 15 and higher in case of every file runs parallel. Is there a ways to configure and reduce the the parallalism of dashing?

Here is an example of a file, every Service Monitor Looks as same.

require './lib/icinga2'

icinga = Icinga2.new('config/icinga2.json') # fixed path
SCHEDULER.every '75s', :first_in => 0 do |job|
icinga.run

    ### Icinga Event API
    service_fileshare_status = "OK"
    color_service_fileshare_status = "green"

    begin
        service_output = "------------ FileShare -------------------------\n"
        obj_service_fileshare = icinga.getServiceObjects([ "__name", "state", "acknowledgement" ], 'host.vars.application == "FileShare"', [ "host.name", "service.state", "service.acknowledgement" ])

        service_stats = []
        obj_service_fileshare.each do |service|
            service_stats.push({ "label" => service["attrs"]["__name"], "value" => service["attrs"]["state"] ? 'OK' : 'CRITICAL' })

            service_output +=  "  Service: " + service["attrs"]["__name"].to_s + "\n"

            if service["attrs"]["acknowledgement"].to_i == 0
                if service["attrs"]["state"].to_i != 0 && service["attrs"]["state"].to_i != 3
                    if service["attrs"]["state"].to_i == 1
                        service_fileshare_status = "HANDLE"
                        color_service_fileshare_status = "yellow"
                    else
                        service_fileshare_status = "CRIT"
                        color_service_fileshare_status = "red"
                    end
                end
            end
        end
        service_output += "  Result FileShare - Services: " + service_stats.size.to_s + " / " + service_fileshare_status.to_s + "\n"
    rescue
        service_output += "  Result FileShare - Services: Error collecting data\n"
    end

    service_output += "---------------------------------------------\n\n"

    puts service_output.to_s

    ### Icinga Dashing Event Handler
    send_event('icinga-service-fileshare-status', {
        value: service_fileshare_status.to_s,
        color: color_service_fileshare_status })
end
dnsmichi commented 7 years ago

Jobs are executed by the underlaying thin server. You'll need to investigate why such connections take that long to fetch data (i.e. by putting in some start and end timestamps to logging).

bodsch commented 7 years ago

@blacks7 create you more than one Icinga Client instances (once per ruby file) or only one? create you many schedulers (once per ruby file) or only one?

as a programmer, i know that the used icinga library is not performance optimized! (i think she has a PoC status) i analyzed a couple of function and see potential for optimizing (and i will create a set of Pull Requests) the second, thin is not the best choise for a ruby based web-server. i plan tests with puma as alternative.

and last, the access of arrays above are also not optimized. multiple accesses of one field into a array (e.g. service["attrs"]["__name"]) should be store in a variable. I think for access of service["attrs"]["state"] exists a function. but i'm not sure.

dnsmichi commented 7 years ago

No, the library has left PoC status. It is just that I am not a good Ruby coder. If anyone wants to increase performance or propose better code style, feel free to do so. I'll guide you if I can, but won't do it myself. If someone gives my employer money for that, I could do it probably :p

blacks7 commented 7 years ago

The code u see in the opening is in every file the same, only the used Services we check are different. Now we set every file to a different Intervall (60s - 120s) to reduce the Performance issue.

dnsmichi commented 7 years ago

Ah, now I get it. You have 30 jobs running, each with a different query and event handling. That means 30 data fetchers, possibly executed in the very same second by Dashing. You'll probably need some scheduled delays here.

I doubt that this is related to this project though, as we just provide one dashboard and job runner. I've never experimented with multiple jobs being run, as I prefer to keep things pulled in just once.

30 jobs will also fire 30 API queries which fetch all service objects - this is redundant and expensive, cpu and memory wise.

You might dig into general dashing forums, and ask about multiple jobs and stale connections and whatnot.

bodsch commented 7 years ago

@dnsmichi i will see, what i can do ... ;)

dnsmichi commented 7 years ago

@bodsch I don't see much you can do here. Having multiple jobs which fetch redundant data is a bad design, and should be changed.

dnsmichi commented 7 years ago

I'm closing this here, feel free to jump onto dashing and icinga community channels to discuss parallel jobs further.