pdreker / fritz_exporter

Prometheus exporter for Fritz!Box home routers
Other
144 stars 32 forks source link

Add handling for xml error when no battery stats are reported by the device #303

Closed r0ckarong closed 3 months ago

r0ckarong commented 3 months ago

As previously mentioned here:

https://github.com/pdreker/fritz_exporter/issues/245#issuecomment-1875912021

My watchtower instance automatically updated my container to the 2.4 release and it started crashing immediately. You mentioned in your comment here that you would consider just handling the parse error. This is what this patch should do.

Unfortunately, I'm not really able to test because I just can't get poetry installed on my system and I'm still struggling to get a newer python to work since the 3.6 release I have currently won't let me run the code directly.

pdreker commented 3 months ago

I'll take it. I cannot test this "in real life" either, as I don't have any Home Automation Devices on my Fritz Box.

pdreker commented 3 months ago

NB: I assumed the exporter could be built from the Dockerfile with no external dependencies, but it seems this is not the case. I'll look into a clean Docker Multistage build, so it can be built with no local requirements aside from Docker.

r0ckarong commented 3 months ago

So far no problems with the 2.4.2 container on my side. Thanks for the quick turnaround!

r0ckarong commented 3 months ago

Correction: Not all good. The container doesn't crash and I don't see any immediate problems but the metrics are not showing up in prometheus anymore. I haven't yet found out what's wrong. The configuration for the containers is identical to before and just reverting back to the 2.3 image works fine.

pdreker commented 3 months ago

Can you check, what the exporter reports back via "curl"? curl http://<EXPORTER_IP>:9787/metrics should at least show the "usual" metrics. When I try 2.4.3 (no real changes from 2.4.2, except for a broken ARMv6/7 build I get the basic stats from my box, but obviously I do not get HA statistics, as I donÄt have any HA devices on my box here.

pdreker commented 3 months ago

Also "Setting log:level: DEBUG in the YAML config might youeld more interesting output, if not already enabled.

r0ckarong commented 3 months ago

One potential candidate for the problem:

Got IndexError for index 35, stopping

2024/03/10 13:59:01,stderr,"2024-03-10 12:59:01,335    DEBUG fritzexporter.fritzcapability | Got IndexError for index 35, stopping
"
2024/03/10 13:59:01,stderr,"2024-03-10 12:59:01,321    DEBUG fritzexporter.fritzcapability | Fetching home automation device information for index 35
"
2024/03/10 13:59:00,stderr,"2024-03-10 12:59:00,778    DEBUG fritzexporter.fritzcapability | Fetching home automation device information for index 34
"
2024/03/10 13:59:00,stderr,"2024-03-10 12:59:00,218    DEBUG fritzexporter.fritzcapability | Fetching home automation device information for index 33
"
2024/03/10 13:58:59,stderr,"2024-03-10 12:58:59,639    DEBUG fritzexporter.fritzcapability | Fetching home automation device information for index 32
... [more devices came before]

Need to look at the code what this actually does.

pdreker commented 3 months ago

The code assumes, that the device have a strictly consecuitve index (1, 2, 3, 4, ... NOT 1, 2, 4, ...). There is no way to ask for how many devices there are, so the code simply enumerates the devices until it gets an IndexError and stops there.

r0ckarong commented 3 months ago

Ok so my suspicion at the moment is that there is some timing problem between the exporter and prometheus. The new version takes WAYY longer to complete the metrics listing than the old one. In 2.3 I get <5s complete times on the curl check. For 2.4.2 it takes upwards of 23 seconds for the same thing.

I already have a 60s scrape interval but nothing ever shows up in prometheus. Right now I'm at a loss what could be happening.

pdreker commented 3 months ago

You can check the prometheus webinterface under Status -> Targets, to see what Prometheus thinks of the exporter.

If it really takes 23s to fetch all information it may be worth checking the Prometheus scrape_timeout setting. IIRC this is 15s by default.

pdreker commented 3 months ago

Against my FritzBox 5590 Fiber running with HostInfo enabled takes 6.6s for 45 devices in my WiFi. 23s is an awfully long time...

If you have 35 HA devices, 23s boils down to 650ms per device (completely ignoring all other metrics). The Fritz Boxes are not known to be super fast, so this may actually be expected. The code enumerates all devices, and then sequentially reads their status info. Soooo... This may actually be OK or to be expected.

In the end I think Prometheus scrape_timeout is my best guess at the moment.

pdreker commented 3 months ago

Uuh... reading the code explains the performance :-/ Definitely something that needs to be improved...

The exporter naively walks through the devices by enumerating them by index until it fails. If it does not fail, it will parse the whole status response from TR-064 adds a metric ton of metrics (pun intended) and then calls "getdeviceinfos" with the AIN of the device via HTTP, and again adds a ton of metrics.

So it does a lot of things and uses up two remote calls to the box per HA device. I think it would be smarter to *only use the HTTP API for the HA stuff, especially since it seems, that there is a call, which returns all devices, but not all infos for all devices...

So... yes, it currently is slow. and this needs work. Unfortunately this is not something I can do quickly, as I will have to rework the HA side of things a implement a more smart XML "Parser" to extract all the information with fewer calls.

pdreker commented 3 months ago

I have opened a new issue for this: #310