quentin-st / Freebox-OS-munin

Low depencendies Freebox OS munin plugin
GNU General Public License v2.0
17 stars 10 forks source link

Can't get a continuous tracking (part2) #7

Open bousqi opened 8 years ago

bousqi commented 8 years ago

It appears that a bug is still present in freebox statistics tracking. This time I don't have any clue on the issue (no exception nor error messages).

Here is the bug behavior : after a certain period of time (from 4 days up to 1 week), some statistics are stuck. Munin alway gets the same value while if I connect to the Freebox server, the value are different. It appears that restarting the freebox does fix the problem, but it is not really linked to the box itself (has the reported value on internal webserver are updated).

Here are some graph where you can see some area where value are stuck :

freebox_xdsl-month freebox_switch1-month freebox_temp-month freebox_traffic-month

Values are updated when box has been restarted. This issue only concerns temperature, xdsl, traffix and switch.

Manully running plugins gives results, but not the correct one.

$ sudo munin-run freebox-temp cpum.value 63.0 cpub.value 62.0 hdd.value 39.0 sw.value 52.0

quentin-st commented 8 years ago

I just checked my own stats: I'm unfortunately not experiencing this issue:

image

First: are you using the latest script version? A git pull should ensure you are.

Which version of Python are you using?

The graphs you're mentioning uses the /rdd endpoint. It is the only endpoint where we need to specify date_start and date_end parameters, which are computed here: main.py#257

Could your system clock be out-of-sync from times to times? I had an issue where my Raspberry Pi time wasn't correct because of wrong NTP settings. I think that during these periods, the date_start_timestamp and date_end_timestamp aren't correctly computed.

For sure (and that's too bad for us), since the plugin isn't misbehaving (at least it thinks so), there's no error log anywhere.

bousqi commented 8 years ago

I'm running on the master head of your git. I was thinking that python 3 was my default interpreter but in fact my system is using 2.7.9. I did checked with 3.4.2 and 2.7.9 and results are the same (stuck values). Your NTP remark is interesting. My raspberry clock seems to be correct, same date on 3 different systems (NTP synchronized). Maybe the freebox has a clock bias (it would explain that freebox reboot fix it, and that bias increase over the time). I'm thinking of it, but I don't know how to verify this. Any suggestion ?

bousqi commented 8 years ago

Funny thing, graphics are also lost on freebox side... So it might not be an issue in getting values, but rather the plugin crashing the tracking on freebox.

fbox_temp

I did not realized till now because I was just checking the temperature on first page, where values are ok :

fbox_temp2

Freebox server version is 3.3.3 (up to date).

quentin-st commented 8 years ago

Sure! Just create a test.py file with the following content:

import datetime

now = datetime.datetime.now()  # math.ceil(time.time())
now = now.replace(second=0, microsecond=0)
date_end = now.replace(minute=now.minute - now.minute % 5)  # Round to lowest 5 minutes
date_start = now - datetime.timedelta(minutes=5)  # Remove 5 minutes from date_end

print(date_end)
print(date_start)

chmod & run it (with the same Python version as the one munin uses to be sure), and check if the dates are correct

bousqi commented 8 years ago

rpi-stable:/usr/local/src/munin-freebox (master) $ date Fri Oct 7 14:12:37 CEST 2016 rpi-stable:/usr/local/src/munin-freebox (master) $ ./date.py 2016-10-07 14:10:00 2016-10-07 14:07:00

quentin-st commented 8 years ago

About your temperatures screenshot: that's really weird indeed. Maybe repeated API calls breaks data storage on the Freebox side. Then, our script isn't able to correctly read these values (or the API returns a stable value while there is none)

Could you try to disable our script for now, and check if Freebox OS's stats goes back to normal?

About your script output: the dates seems to be OK

Edit: these dates are not OK actually, you should have this: 2016-10-07 14:10:00 2016-10-07 14:05:00

bousqi commented 8 years ago

all freebox plugins have been removed, and munin-node restarted. I'll wait for a few minutes/hours to check if freebox graphics are resurrected.

quentin-st commented 8 years ago

Alright, I'm fixing the dates issue - which isn't really one as long as the script is run when the minutes component of the current time is a multiple of 5

bousqi commented 8 years ago

How many plugins are enabled on your munin server ? Is the freebox a classic one or an optical one ? last firmware ?

quentin-st commented 8 years ago

44 plugins, Freebox Revolution, last firmware

(tip: I'm using Material-Freebox-OS to spare my eyes when browsing Freebox OS)

bousqi commented 8 years ago

Freebox Server (r2) ?

Till now the graphics are still dead on the freebox. I'll reboot it later. I guess i'll have to add one by one each plugin to check which one crash to tracking on the box.

About the rrd queries. Is it possible that Munin makes to many concurrent queries to rrd API ? Would it be possible to process them sequentially rather that in parallel ? What would be your approach to identify the origin of this problem ?

quentin-st commented 8 years ago

So the stats on the Freebox are still crashed. We may predict that everything will be OK after a reboot.

I never heard about too many concurrent queries being a problem with the rrd API, even with Freebox Stats... I'm not really sure if munin calls each plugin sequentially or in parallel. Though, since munin is responsible for this logic and we cannot override this, we have no maneuver margin here.

We don't have access to any Freebox OS log either, so our only solution seems to be opening on issue on Freebox OS's bug tracker.