ginoledesma / sunpower-pvs-exporter

A Prometheus Exporter for the SunPower PVS monitoring system
MIT License
77 stars 27 forks source link

Triggering CPU/session/memory leak on recent PVS6 firmware releases #11

Open aqua opened 2 years ago

aqua commented 2 years ago

This probably isn't a bug with the exporter itself, but late last year my PVS6 unit (2021.11, Build 60203 when I was diagnosing this) started getting unstable when monitored using the exporter, turning its status LED to fault-orange after about a day of uptime, with metrics collection first becoming unstable and then failing entirely. Attaching an issue here since it's where users of the exporter are probably going to look, and because this is where it'll have to be worked around if Sunpower doesn't fix it (which they have little incentive to do).

What appears to be happening is that the PVS6 starts to respond more slowly to the exporter's commands, until it eventually reaches the request timeout and appears to go dead. The sunpower_pvs_supervisor_cpu_loadavg metric rises on a roughly-exponential curve until that point:

pvs6-leak

sunpower_pvs_supervisor_cpu_loadavg comes from the supervisor device's dl_cpu_load value, and although that leaves some ambiguity in whether it's CPU utilization or loadavg, I suspect the exporter's interpretation is correct, especially because a memory leak in the supervisor causing a gradual increase in demand paging is a much likelier bug than a CPU leak, e.g. if the supervisor is doing some sort of inefficient traversal over an increasingly large internal data structure or something.

This doesn't happen if the exporter isn't collecting, so I'm currently doing some testing with reduced scrape intervals to see if the time to failure is proportional to scrape rate.

This doesn't affect Sunpower's first-party monitoring consoles because the data feed to them is a periodic batch push rather than collection, where a thrash in the PVS6's virtual memory will typically only cause delays, not dropouts.

ginoledesma commented 2 years ago

Thanks for letting me know. Out of curiosity, how many panels do you have? I’ve seen similar behavior from someone who has 30+ panels.

out of curiosity, does it resolve on its own, whether by some automatic resolution (such as out of memory) or does it requires manual intervention like a reboot? I can imagine reducing the scrape frequency would just delay this.

Officially, SunPower has stopped “supporting” this API access, so I wouldn’t be surprised if they’re gonna disable it in a future build.

I’m looking at alternate ways to capture the data — whether tapping into the micro inverter stream itself or the connection back upstream.

aqua commented 2 years ago

16 panels. A relationship between inverter count and the memory load occurred to me too, especially because the DeviceList and Stop commands both tend to take ~4-6 seconds under normal conditions. I'm holding out a faint hope that the bug arises from fragile coding in the PVS6 service interface around session start/stop, and that a more generous timeout may avoid triggering the leak. Probably not, but if larger systems are more prone to it, it seems a possibility. I agree that if not, adjusting the polling interval is just going to extend the time to failure.

The PVS6 does not appear to recover on its own if the exporter is disabled, though I think the longest I've let it sit that way is about a day; there's either no watchdog in the unit or it's a very tolerant one.

With Prometheus on a 60s collection interval, time to failure after a PVS6 reset seems to be about 14 hours. The shortest my data shows is 10h. If a longer collection interval linearly extends the TTF, as I suspect, I'll toy with the timeouts a bit also.

jeeftor commented 2 years ago

Is there a reason we need to call start and stop ... doesn't the DeviceList call work w/out them? (not sure if that's related at all)

ginoledesma commented 2 years ago

Probably not. I think the start and stop calls are to indicate a locking-like behavior for configuring the PVS.

jeeftor commented 2 years ago

That's my understanding as well

aqua commented 2 years ago

I let the unit run a couple days with polling reduced to 5m intervals and --timeout=240. After getting gradually slower over the next 18h it seemed to stop getting worse, with call latencies up in the tens of seconds and loadavg up around 80 (status LED went orange), and it ran that way for about another 24h, then abruptly cleaned up and slowly degraded again as before.

With slow polls and very generous deadlines it's actually more or less producing useful data, albeit low resolution and with a lot of dropouts (about every hour or so GridProfileGet will fail, and the scan_time metric jumps to the many minutes once an hour, perhaps when the PVS6 does its push to Sunpower's service).

I'm giving it another shot with the start/stop calls disabled, and if that doesn't make things better I might also try without the Get_Comm calls (those seems to be pretty consistently the slowest call, at around 50s once in the bad state, so perhaps more likely to trip on a memory management or driver bug.)

Screen Shot 2022-01-26 at 12 15 09 AM
aqua commented 2 years ago

A couple more mostly-unhelpful observations:

I'm giving it a whirl with GridProfileGet disabled as well, and will let that run a while to see the periodic self-recovery recurs or was a fluke.

aqua commented 2 years ago

I never did find a viable workaround (disabling GridProfileGet doesn't help), but in digging through installer manuals I noticed the only documented reason for a steady orange status LED (which would usually surface once the loadavg got up there) is a failure to update the device, which got me thinking a bit, and after forcing the PVS6 to update itself to 2022.4, build 60630 exporter scan times and loadavg have been low and stable for ~24h, which it never did before. So possibly the act of forcing an upgrade stopped the issue from triggering, or possibly Sunpower fixed the underlying bug sometime in the last several months. I don't think it's just a failure to update, since that status LED never went orange when I wasn't polling for monitoring data, but conceivably a buggy interaction between the two.

For the record, the way an upgrade is forced is first to do a GET request to/cgi-bin/dl_cgi/firmware/new_version, which returns a short JSON blob like so:

{
    "url":  "https://fw-assets-pvs6-dev.dev-edp.sunpower.com/release-13.3/60630/fwup/fwup.lua"
}

.... and then do a POST to /cgi-bin/dl_cgi/firmware/upgrade?url=THAT_URL. The actual update process looks like it A/B swaps root+kernel partitions and takes about five minutes.