naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
154 stars 63 forks source link

Slow startup with many hosts/services out of check period #82

Open julianbrost opened 10 years ago

julianbrost commented 10 years ago

Naemon takes multiple minutes to start with about 100 hosts and 2000 services with check_period nopowersave as defined in this config section:

define timeperiod {
        timeperiod_name                vorlesungszeit
        2014-04-01 - 2014-07-12        00:00-24:00
        2014-10-06 - 2015-01-31        00:00-24:00
        2015-04-13 - 2015-07-18        00:00-24:00
        2015-10-12 - 2016-02-06        00:00-24:00
        2016-04-11 - 2016-07-16        00:00-24:00
        2016-10-17 - 2017-02-11        00:00-24:00
}

define timeperiod {
        timeperiod_name                semesterferien
        monday                         00:00-24:00
        tuesday                        00:00-24:00
        wednesday                      00:00-24:00
        thursday                       00:00-24:00
        friday                         00:00-24:00
        saturday                       00:00-24:00
        sunday                         00:00-24:00
        exclude                        vorlesungszeit

}

define timeperiod {
        timeperiod_name                nopowersave
        monday                         08:45-20:00
        tuesday                        08:45-20:00
        wednesday                      08:45-20:00
        thursday                       08:45-20:00
        friday                         08:45-20:00
        exclude                        semesterferien
}

The debug log mostly contains the message _get_matching_timerange(), here's a short snippet of it, service foo and host bar have check_period nopowersave configured.

[1407139826.151502] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151513] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151558] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151609] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151635] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151646] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151696] [008.1] [pid=21804] Service 'foo' on host 'bar' should not be scheduled.
[1407139826.151704] [001.0] [pid=21804] check_time_against_period()
[1407139826.151712] [001.0] [pid=21804] check_time_against_period()
[1407139826.151721] [001.0] [pid=21804] check_time_against_period()
[1407139826.151730] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151772] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151781] [001.0] [pid=21804] get_next_valid_time()
[1407139826.151789] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151801] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151812] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151891] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151939] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.151982] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152024] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152068] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152111] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152155] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152199] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152241] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152280] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152292] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152335] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152412] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152456] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152499] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152542] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152584] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152628] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152659] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152691] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152722] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152765] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152875] [001.0] [pid=21804] _get_matching_timerange()
[1407139826.152944] [001.0] [pid=21804] _get_matching_timerange()

Stracing the process gives loads of stats on /etc/localtime:

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0

GDB backtrace while startup:

#0  0x00007f53a851ab75 in *__GI___xstat (vers=<optimized out>, name=<optimized out>, buf=0x7ffff8879000) at ../sysdeps/unix/sysv/linux/wordsize-64/xstat.c:38
#1  0x00007f53a84ea550 in __tzfile_read (file=0x7f53a859888e "/etc/localtime", extra=0, extrap=<optimized out>) at tzfile.c:173
#2  0x00007f53a84e9948 in tzset_internal (always=<optimized out>, explicit=<optimized out>) at tzset.c:439
#3  0x00007f53a84e9bc0 in __tzset () at tzset.c:592
#4  0x00007f53a84e8a09 in *__GI_mktime (tp=0x7f53a859888e) at mktime.c:515
#5  0x000000000044e13e in ?? ()
#6  0x000000000044e598 in _get_next_valid_time ()
#7  0x000000000044e701 in _get_next_valid_time ()
#8  0x000000000044e83b in get_next_valid_time ()
#9  0x0000000000432e37 in init_timing_loop ()
#10 0x0000000000417043 in main ()

Naemon version is 0.8.0 with the patch from commit ad800d8a0e14901bf4adc31231693ca0fd9fa9f2 applied, running on Debian wheezy.

ageric commented 10 years ago

On 2014-08-04 10:48, Julian wrote:

Naemon takes multiple minutes to start with about 100 hosts and 2000 services with check_period nopowersave as defined in this config section:

Yes. This is a known issue. It basically boils down to the fact that get_next_valid_time() and get_next_invalid_time() iterate 300 times in order to find a suitable timerange (ie, a valid day where there actually is one). It's more or less insane, but that's the reason for it anyways.

I'm working on changing that, but coming up with a correct and clever algorithm is proving difficult, since we really don't want the daylight savings time bug to come back again.

/Andreas