meshtastic / firmware

Meshtastic device firmware
https://meshtastic.org
GNU General Public License v3.0
2.98k stars 714 forks source link

[Bug]: config file corruption seen twice on a wm1110 board #4184

Open geeksville opened 6 days ago

geeksville commented 6 days ago

Category

Other

Hardware

Other

Firmware Version

2.3.14.c67a9dfe

Description

I bet it is not wm1110 specific. Occurred while doing hundreds of power cycles.

I bet the best way to find/fix it is to turn off our "if config read fails or is corrupted completely factory reset and try again". Instead we should spin inside the ICE debugger.

Notes from chat:

@thebentern re: the mystery "lora." config got toasted thing that happened to mw wm1110 board happened again. just fyi, I'll keep an eye on it and add more instrumentation while I'm doing my other stuff but possibly there is some badness somewhere. Possibly also I'm just inadvertently stress testing because I'm cycling this board through >100 power cycles in different configs (but none of that should have led us to corrupt our flash fs). I only noticed because my power.powermon_enables field also got toasted. thebentern — Today at 6:26 PM I covet your experienced eyes on that issue, because so far it's been elusive and seemingly random geeksville — Today at 6:37 PM hmm - rather than mystery corruption I wonder if there is a bug in the adafruit nrf52 fake filesystem stuff. after I finish power crap (about another week?) i'll try to make a robust stress test and leave it running while ICEd.

Relevant log output

No response

geeksville commented 6 days ago

I might (hopefully will) look into this in a week or two.

geeksville commented 6 days ago

This same corruption probably exposed #4167 a couple of days ago.

geeksville commented 5 days ago

hmm this is less that perfect (though quite possibly unrelated to the problem):

        // brief window of risk here ;-)
        if (FSCom.exists(filename) && !FSCom.remove(filename)) {
            LOG_WARN("Can't remove old pref file\n");
        }
        if (!renameFile(filenameTmp.c_str(), filename)) {
            LOG_ERROR("Error: can't rename new pref file\n");
        }

We could eliminate this window of risk by renaming the file.new to be file.good, then remove file, then rename file.good to be filename (a 3 stage commit). Then at load time if we ever see a file.good existing, we know that we lost power during that window and file.good should be used instead of file (and file should be deleted at that point.

But this might not actually be the bug, so I'll wait until I look into this and somehow make a reproducable failure.