shinken-solutions / shinken

Flexible and scalable monitoring framework
http://www.shinken-monitoring.org
GNU Affero General Public License v3.0
1.13k stars 336 forks source link

Leaking memory on shinken-receiver #1850

Open agapoff opened 8 years ago

agapoff commented 8 years ago

I have such a configuration: arbiter, broker, scheduler, poller, reactionner and receiver in one geographical location (realm) and scheduler, poller, reactionner and receiver in another location (not spare nodes but active).

And the memory is leaking on the remote receiver - several gigs per day until the OOM kills it.

I have tried to turn on the direct_routing but then the broker starts to eat the memory (it is being exhausted in couple of hours).

How can this issue been investigated or solved?

NicolasLM commented 8 years ago

I have the same issue, I've been trying to track the leak down.

From what I see, the receiver accumulates instances of shinken.brok.Brok, one per passive check received. Passive checks are processed correctly, but they are somehow kept in memory.

geektophe commented 8 years ago

Do you have enough bandwidth between your sites to deal with your check results ?

If so, I'm currently working on a patch to track down memory leaks in shinken: https://github.com/naparuba/shinken/pull/1828

Perhaps you could give it a try.

agapoff commented 8 years ago

The bandwidth is enough and iftop doesn't show enormous throughput between the hosts. But there is still a high latency between the sites.

NicolasLM commented 8 years ago

Bandwidth does not seem to be an issue here neither.

While your PR is certainly a good step forward, it does not seem related to this issue.

It seems to me that the self.broks dict in Receiver is not purged correctly, adding a small log to display the amount of broks shows:

[1462890437] WARNING: [receiver] 4007798 broks and 0 unprocessed commands

Receiver has been running for 5 minutes.

geektophe commented 8 years ago

Indeed, it's a huge amount of broks... Could you tell us how many passive checks you have, and at which frequency they are fed ?

NicolasLM commented 8 years ago

3K passive checks, every 60 seconds. I have two receivers, each of them getting half of the traffic.

olivierHa commented 8 years ago

@Vitaly,

Could you paste your Shinken configuration ? Is your Broker in the "All" realm ?

Regards

Olivier

2016-05-10 16:56 GMT+02:00 Nicolas Le Manchet notifications@github.com:

3K passive checks, every 60 seconds. I have two receivers, each of them getting half of the traffic.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/naparuba/shinken/issues/1850#issuecomment-218183702

geektophe commented 8 years ago

@NicolasLM That's far less than I expected for such a huge number of broks. There's something weird... I'll try to reproduce this on my dev platform.

agapoff commented 8 years ago

Broker config:

define broker {
    broker_name     broker-master
    address         localhost
    port            7772
    spare           0
    manage_arbiters     1
    manage_sub_realms   1
    timeout             3
    data_timeout        120
    max_check_attempts  3
    check_interval      60
    modules webui2, graphite2
    use_ssl           0
    hard_ssl_name_check   0
    realm   All
}

Receivers config

define receiver {
    receiver_name   receiver-master
    address         localhost
    port            7773
    spare           0
    timeout             3
    data_timeout        120
    max_check_attempts  3
    check_interval      60
    modules nsca
    use_ssl           0
    hard_ssl_name_check  0
    direct_routing      0
    realm   All
}

define receiver {
    receiver_name   receiver-london
    address        <...>
    port            7773
    modules         nsca
    spare           0
    timeout     3
    data_timeout    120
    max_check_attempts 3
    check_interval  60
    realm           London
    #direct_routing  1
}
geektophe commented 8 years ago

I had a test on my dev platform. I looped over 4000 passive check results sent to a single receiver through the ws_arbiter module. The 4000 results are pushed in 30 to 90 seconds (my dev platform is quite loaded :) ), and the memory usage is stable.

Could you tell is if the the process eating memory is the receiver daemon itself, or the NSCA module ?

Do you notice the same problem on both receivers, or only on the remote one ?

agapoff commented 8 years ago

The memory leaking is observed only on the remote receiver. Local receiver works fine for months.

This process is eating the memory: python /usr/sbin/shinken-receiver -d -c /etc/shinken/daemons/receiverd.ini

geektophe commented 8 years ago

The check results (external commands, in fact) are sent from receiver to the scheduler (through the arbiter), but the broks are gathered by the broker.

I see two different possibilities:

For the second statement, it could come from the realm settings. Could you send your realm configuration also ?

NicolasLM commented 8 years ago

In my case the process leaking is the NSCA module itself, I checked that through strace. It should be possible to create a fake receiver module doing roughly.

while true:
    self.from_q.put(ExternalCommand('whatever'))
agapoff commented 8 years ago

There is nothing special in realms config:

define realm {
    realm_name  All
    realm_members Limassol,London
    default     1
}
define realm {
    realm_name  London
    default     0
}
define realm {
    realm_name  Limassol
    default     0
}
olivierHa commented 8 years ago

Could you add this option to your realm ?

broker_complete_links 1

Regards

Olivier

2016-05-12 8:25 GMT+02:00 Vitaly Agapov notifications@github.com:

There is nothing special in realms config:

define realm { realm_name All realm_members Limassol,London default 1 } define realm { realm_name London default 0 } define realm { realm_name Limassol default 0 }

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/naparuba/shinken/issues/1850#issuecomment-218670749

agapoff commented 8 years ago

I have added broker_complete_links 1 and there are no notable changes.

agapoff commented 8 years ago

I have conducted an experiment. I have replaced NSCA with snmp traps and now I receive passive checks via snmptrapd -> named-pipe -> shinken-poller. And what is the most interesting - I am observing the memory leaks on remote poller. It eats memory slower than receiver did but the OOM-killer will surely come one fine day.

geektophe commented 8 years ago

How many checks does your remote poller execute under normal conditions (without the named pipe injection) ? Does it leak under this workload ?

And just to be sure, did you run an iperf/mtr test between your broker and your remote nodes ? The leak you describe still make me think there's a communication issue between the services (too much latency or packet loss) preventing the broker from downloading the broks faster than they arrive.

Sorry for all that tests, but we need to eliminate obvious options first.

NicolasLM commented 8 years ago

If it adds value: today I noticed a server where both broker and receiver were leaking at the same time.

olivierHa commented 8 years ago

Any updates/clues/hints ?

NicolasLM commented 8 years ago

Anything new about this topic? I would be pleased if I could remove the crontab that restarts the broker and receiver twice a day.

geektophe commented 8 years ago

Hi, sorry for this late answer. Could you have a test with this PR (https://github.com/naparuba/shinken/pull/1828) by setting max_q_size to 1024 and results_batch to 2048 on pollers and reactionners, and broks_batch to 2048 on the broker in the shinken configuration ?

geektophe commented 8 years ago

Any news about this issue ? Could you test the linked PR ?