Closed robert-hh closed 3 years ago
Continuing that test:
The core dump is related to the heartbeat RGB led operation, and caused by a race conditions between switching the LED on and off. On heave flash use, the "On" cycle seems to be delayed long enough to collide with the "Off" cycle. ~~The most simple cure seems to be to change lines 136 and 142 from:
led_set_color(&led_info, false, false);
to
led_set_color(&led_info, true, false);
With the change, the new led setting is not executed before the previous one is finished. That test is running at the moment;~~ Since the crash is relatively rare, it will take another 24 hours to get sufficient confidence in the result. Another test I did was simply extending the "On" phase from 80ms to 200ms, which was also fine over a period of 24 hours.
Edit: This simple fix did not work :-( Trying now another option. Edit 2: The LoPy4 failed with the initial code after 2.1 Million cycles.
Another small code change made it run now for 48 hours and about 1 million cycles. Even if I am not 100% confident that the change if fixing the core reason for the fault, it makes it not happen. The change is again in mperror.c to the function mperror_heartbeat_signal(). I'll make a PR request for that on Monday, after letting the test run for another ~48 hours.
bool mperror_heartbeat_signal (void) {
if (mperror_heart_beat.do_disable) {
mperror_heart_beat.do_disable = false;
} else if (mperror_heart_beat.enabled) {
if (!mperror_heart_beat.beating) {
if (mp_hal_ticks_ms_non_blocking() > mperror_heart_beat.off_time) {
led_info.color.value = MPERROR_HEARTBEAT_COLOR;
led_set_color(&led_info, true, false);
mperror_heart_beat.beating = true;
mperror_heart_beat.on_time = mp_hal_ticks_ms_non_blocking() + MPERROR_HEARTBEAT_ON_MS;
}
} else {
if (mp_hal_ticks_ms_non_blocking() > mperror_heart_beat.on_time) {
led_info.color.value = 0;
led_set_color(&led_info, true, false);
mperror_heart_beat.beating = false;
mperror_heart_beat.off_time = mp_hal_ticks_ms_non_blocking() + MPERROR_HEARTBEAT_OFF_MS;
}
}
}
// let the CPU save some power
return true;
}
@robert-hh : May I ask that did this last change solve the problem?
@geza-pycom
May I ask that did this last change solve the problem?
Yes, it did.
Thanks, I am closing this issue as the corresponding PR will be part of an upcoming release.
Firmware: 1.20.2.r4, WiPy3, with a change for block level wear leveling and RGB led guard. The error happens during a long term test, aiming at writing the same short file several million times, in order to verify the wear leveling change. At random intervals, about every 50_000 file open/write/close cycles, a core dump happens. It may happen after a few thousand cycles, or after 100_000 cycles. Test code:
The backtrace of a typical core dump looks like. I decoded 6 core dumps. They all happen on a file close. And the exception always happens at the rmt_tx interrupt, which is served by Core1. RMT is activated by the heartbeat in the main loop, which is served by Core 0. The backtrace of Core shows, that at the time of the heartbeat the core 0code is working on sending a RGB led flash. Whether still or again, is unclear.