pycom / pycom-micropython-sigfox

A fork of MicroPython with the ESP32 port customized to run on Pycom's IoT multi-network modules.
MIT License
200 stars 166 forks source link

LFS: Core dumps on heavy file operations #518

Closed robert-hh closed 3 years ago

robert-hh commented 3 years ago

Firmware: 1.20.2.r4, WiPy3, with a change for block level wear leveling and RGB led guard. The error happens during a long term test, aiming at writing the same short file several million times, in order to verify the wear leveling change. At random intervals, about every 50_000 file open/write/close cycles, a core dump happens. It may happen after a few thousand cycles, or after 100_000 cycles. Test code:

import time
line = "*" * 63 + "\n"
b = 31 * line

def write(name, n):
    f = open(name, "w")
    f.write("%8d\n" % n)
    f.write(b)
    f.close()

def run1(start=1, end=10_000_000):
    try:
        f = open("logfile")
        l = f.readline()
        f.close()
        n = int(l.strip())
    except:
        n = start
    print(n, "Restart ")

    while n < end:
        print(n, "--- ")
        write("logfile", n)
        n += 1
        time.sleep(0.1)

The backtrace of a typical core dump looks like. I decoded 6 core dumps. They all happen on a file close. And the exception always happens at the rmt_tx interrupt, which is served by Core1. RMT is activated by the heartbeat in the main loop, which is served by Core 0. The backtrace of Core shows, that at the time of the heartbeat the core 0code is working on sending a RGB led flash. Whether still or again, is unclear.

Guru Meditation Error: Core  1 panic'ed (Interrupt wdt timeout on CPU1)
Core 1 register dump:
PC      : 0x4009602e  PS      : 0x00060034  A0      : 0x800952ec  A1      : 0x3ffbf4f0  
A2      : 0x3ffe2b4c  A3      : 0x00060023  A4      : 0x00000001  A5      : 0x0000cdcd  
A6      : 0xb33fffff  A7      : 0x3ffdd385  A8      : 0x0000cdcd  A9      : 0x00000000  
A10     : 0x0000039f  A11     : 0x0000abab  A12     : 0x3f40df58  A13     : 0x00000000  
A14     : 0x3f40df40  A15     : 0x01f12318  SAR     : 0x00000003  EXCCAUSE: 0x00000006  
EXCVADDR: 0x00000000  LBEG    : 0x40093a80  LEND    : 0x40093a93  LCOUNT  : 0x00000000  
Core 1 was running in ISR context:
EPC1    : 0x400f5227  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x4009602e

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x4009602e:0x3ffbf4f0 0x400952e9:0x3ffbf520 0x40088749:0x3ffbf540 0x40084b3d:0x3ffbf580 0x400f5224:0x3ffdd3b0 0x400f557a:0x3ffdd3e0 0x400f7375:0x3ffdd460 0x400f7a35:0x3ffdd520 0x400f7dc9:0x3ffdd5d0 0x400f7df5:0x3ffdd610 0x400f89c1:0x3ffdd630 0x400f9672:0x3ffdd650 0x400f9738:0x3ffdd670 0x4010c991:0x3ffdd690 0x40107759:0x3ffdd6c0 0x40103b61:0x3ffdd6e0 0x40103bf1:0x3ffdd700 0x40110a57:0x3ffdd720 0x4010786c:0x3ffdd7c0 0x40103b61:0x3ffdd830 0x401109c9:0x3ffdd850 0x4010786c:0x3ffdd8f0 0x40103b61:0x3ffdd920 0x40103bf1:0x3ffdd940 0x40110a57:0x3ffdd960 0x4010786c:0x3ffdda00 0x40103b61:0x3ffdda80 0x40103b8e:0x3ffddaa0 0x400e6f15:0x3ffddac0 0x400e71a5:0x3ffddb60 0x400e5dd3:0x3ffddb80

Core 0 register dump:
PC      : 0x40097d80  PS      : 0x00060c34  A0      : 0x80096b48  A1      : 0x3ffbb960  
A2      : 0x3ffe2b24  A3      : 0x3ffbbb28  A4      : 0x3ffbbb0c  A5      : 0x00000001  
A6      : 0x000000fe  A7      : 0x00060023  A8      : 0x3ffbbb28  A9      : 0x00000019  
A10     : 0x3ffbbb28  A11     : 0x00000019  A12     : 0xb33fffff  A13     : 0x0000cdcd  
A14     : 0x000000fe  A15     : 0x00000000  SAR     : 0x0000000a  EXCCAUSE: 0x00000006  
EXCVADDR: 0x00000000  LBEG    : 0x400f27a8  LEND    : 0x400f27eb  LCOUNT  : 0x00000000  

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x40097d80:0x3ffbb960 0x40096b45:0x3ffbb980 0x40095473:0x3ffbb9a0 0x401690f1:0x3ffbb9e0 0x400f27fa:0x3ffbba10 0x400f9bfe:0x3ffbba30 0x400d3cf1:0x3ffbba50 0x400967bd:0x3ffbba70

------------------------------------------------
Core 1 Backtrace
--------------------------------------------

BT-0: 0x4009602e is in vTaskEnterCritical (/home/peter/docs/pycom-esp-idf/components/freertos/include/freertos/portmacro.h:355).
355     __asm__ __volatile__ (
BT-1: 0x400952e9 is in xQueueGiveFromISR (/home/peter/docs/pycom-esp-idf/components/freertos/queue.c:1324).
1324            taskENTER_CRITICAL_ISR(&pxQueue->mux);
BT-2: 0x40088749 is in rmt_driver_isr_default (/home/peter/docs/pycom-esp-idf/components/driver/rmt.c:579).
579                         xSemaphoreGiveFromISR(p_rmt->tx_sem, &HPTaskAwoken);
BT-3: 0x40084b3d is at /home/peter/docs/pycom-esp-idf/components/freertos/xtensa_vectors.S:1154.
1154        dispatch_c_isr 1 XCHAL_INTLEVEL1_MASK
BT-4: 0x400f5224 is in lfs_bd_read (littlefs/lfs.c:62).
62                  memcpy(data, &rcache->buffer[off-rcache->off], diff);
BT-5: 0x400f557a is in lfs_dir_traverse (littlefs/lfs.c:754).
754             int err = lfs_bd_read(lfs,
BT-6: 0x400f7375 is in lfs_dir_compact (littlefs/lfs.c:1702).
1702                err = lfs_dir_traverse(lfs,
BT-7: 0x400f7a35 is in lfs_dir_commit (littlefs/lfs.c:1954).
1954            int err = lfs_dir_compact(lfs, dir, attrs, attrcount,
BT-8: 0x400f7dc9 is in lfs_file_rawsync (littlefs/lfs.c:2839).
2839            err = lfs_dir_commit(lfs, &file->m, LFS_MKATTRS(
BT-9: 0x400f7df5 is in lfs_file_rawclose (littlefs/lfs.c:2620).
2620        int err = lfs_file_rawsync(lfs, file);
BT-10: 0x400f89c1 is in lfs_file_close (littlefs/lfs.c:5089).
5089        err = lfs_file_rawclose(lfs, file);
BT-11: 0x400f9672 is in littlefs_close_common_helper (littlefs/vfs_littlefs.c:326).
326     int res = lfs_file_close(lfs, fp);
BT-12: 0x400f9738 is in file_obj_ioctl (littlefs/vfs_littlefs_file.c:105).
105             int res = littlefs_close_common_helper(&self->littlefs->lfs, &self->fp, &self->cfg, &self->timestamp_update);
BT-13: 0x4010c991 is in mp_stream_close (../py/stream.c:422).
422     mp_uint_t res = stream_p->ioctl(stream, MP_STREAM_CLOSE, 0, &error);
BT-14: 0x40107759 is in fun_builtin_1_call (../py/objfun.c:70).
70      return self->fun._1(args[0]);
BT-15: 0x40103b61 is in mp_call_function_n_kw (../py/runtime.c:624).
624         return type->call(fun_in, n_args, n_kw, args);
BT-16: 0x40103bf1 is in mp_call_method_n_kw (../py/runtime.c:640).
640     return mp_call_function_n_kw(args[0], n_args + adjust, n_kw, args + 2 - adjust);
BT-17: 0x40110a57 is in mp_execute_bytecode (../py/vm.c:1002).
1002                        SET_TOP(mp_call_method_n_kw(unum & 0xff, (unum >> 8) & 0xff, sp));
BT-18: 0x4010786c is in fun_bc_call (../py/objfun.c:287).
287     mp_vm_return_kind_t vm_return_kind = mp_execute_bytecode(code_state, MP_OBJ_NULL);
BT-19: 0x40103b61 is in mp_call_function_n_kw (../py/runtime.c:624).
624         return type->call(fun_in, n_args, n_kw, args);
BT-20: 0x401109c9 is in mp_execute_bytecode (../py/vm.c:919).
919                     SET_TOP(mp_call_function_n_kw(*sp, unum & 0xff, (unum >> 8) & 0xff, sp + 1));
BT-21: 0x4010786c is in fun_bc_call (../py/objfun.c:287).
287     mp_vm_return_kind_t vm_return_kind = mp_execute_bytecode(code_state, MP_OBJ_NULL);
BT-22: 0x40103b61 is in mp_call_function_n_kw (../py/runtime.c:624).
624         return type->call(fun_in, n_args, n_kw, args);
BT-23: 0x40103bf1 is in mp_call_method_n_kw (../py/runtime.c:640).
640     return mp_call_function_n_kw(args[0], n_args + adjust, n_kw, args + 2 - adjust);
BT-24: 0x40110a57 is in mp_execute_bytecode (../py/vm.c:1002).
1002                        SET_TOP(mp_call_method_n_kw(unum & 0xff, (unum >> 8) & 0xff, sp));
BT-25: 0x4010786c is in fun_bc_call (../py/objfun.c:287).
287     mp_vm_return_kind_t vm_return_kind = mp_execute_bytecode(code_state, MP_OBJ_NULL);
BT-26: 0x40103b61 is in mp_call_function_n_kw (../py/runtime.c:624).
624         return type->call(fun_in, n_args, n_kw, args);
BT-27: 0x40103b8e is in mp_call_function_0 (../py/runtime.c:598).
598     return mp_call_function_n_kw(fun, 0, 0, NULL);
BT-28: 0x400e6f15 is in parse_compile_execute (../lib/utils/pyexec.c:103).
103         mp_call_function_0(module_fun);
BT-29: 0x400e71a5 is in pyexec_file (../lib/utils/pyexec.c:560).
560     return parse_compile_execute(filename, MP_PARSE_FILE_INPUT, EXEC_FLAG_SOURCE_IS_FILENAME);
BT-30: 0x400e5dd3 is in TASK_Micropython (mptask.c:339).
339             int ret = pyexec_file(main_py);
---------------------------------------------
Core 0 backtrace
---------------------------------------------
reading backtrace from stdin
0x40097d80:0x3ffbb960 0x40096b45:0x3ffbb980 0x40095473:0x3ffbb9a0 0x401690f1:0x3ffbb9e0 0x400f27fa:0x3ffbba10 0x400f9bfe:0x3ffbba30 0x400d3cf1:0x3ffbba50 0x400967bd:0x3ffbba70
BT-0: 0x40097d80 is in vListInsert (/home/peter/docs/pycom-esp-idf/components/freertos/list.c:162).
162         pxIterator = pxList->xListEnd.pxPrevious;
BT-1: 0x40096b45 is in vTaskPlaceOnEventList (/home/peter/docs/pycom-esp-idf/components/freertos/tasks.c:2901).
2901        vListInsert( pxEventList, &( pxCurrentTCB[ xPortGetCoreID() ]->xEventListItem ) );
BT-2: 0x40095473 is in xQueueGenericReceive (/home/peter/docs/pycom-esp-idf/components/freertos/queue.c:1590).
1590                    vTaskPlaceOnEventList( &( pxQueue->xTasksWaitingToReceive ), xTicksToWait );
BT-3: 0x401690f1 is in rmt_write_items (/home/peter/docs/pycom-esp-idf/components/driver/rmt.c:819).
819     xSemaphoreTake(p_rmt->tx_sem, portMAX_DELAY);
BT-4: 0x400f27fa is in led_set_color (mods/modled.c:109).
109     for (uint32_t bit_mask = MSB_VALUE; bit_mask != 0; bit_mask >>= 1){
BT-5: 0x400f9bfe is in mperror_heartbeat_signal (util/mperror.c:142).
142                 led_set_color(&led_info, false, false);
BT-6: 0x400d3cf1 is in esp_vApplicationIdleHook (/home/robert/Downloads/MicroPython/pycom-esp-idf/components/esp32/freertos_hooks.c:51).
51          if (idle_cb[core][n] != NULL && !idle_cb[core][n]()) {
BT-7: 0x400967bd is in prvIdleTask (/home/peter/docs/pycom-esp-idf/components/freertos/tasks.c:3382).
3382                vApplicationIdleHook();
robert-hh commented 3 years ago

Continuing that test:

robert-hh commented 3 years ago

The core dump is related to the heartbeat RGB led operation, and caused by a race conditions between switching the LED on and off. On heave flash use, the "On" cycle seems to be delayed long enough to collide with the "Off" cycle. ~~The most simple cure seems to be to change lines 136 and 142 from: led_set_color(&led_info, false, false); to led_set_color(&led_info, true, false); With the change, the new led setting is not executed before the previous one is finished. That test is running at the moment;~~ Since the crash is relatively rare, it will take another 24 hours to get sufficient confidence in the result. Another test I did was simply extending the "On" phase from 80ms to 200ms, which was also fine over a period of 24 hours.

Edit: This simple fix did not work :-( Trying now another option. Edit 2: The LoPy4 failed with the initial code after 2.1 Million cycles.

robert-hh commented 3 years ago

Another small code change made it run now for 48 hours and about 1 million cycles. Even if I am not 100% confident that the change if fixing the core reason for the fault, it makes it not happen. The change is again in mperror.c to the function mperror_heartbeat_signal(). I'll make a PR request for that on Monday, after letting the test run for another ~48 hours.

bool mperror_heartbeat_signal (void) {
    if (mperror_heart_beat.do_disable) {
        mperror_heart_beat.do_disable = false;
    } else if (mperror_heart_beat.enabled) {
        if (!mperror_heart_beat.beating) {
            if (mp_hal_ticks_ms_non_blocking() > mperror_heart_beat.off_time) {
                led_info.color.value = MPERROR_HEARTBEAT_COLOR;
                led_set_color(&led_info, true, false);
                mperror_heart_beat.beating = true;
                mperror_heart_beat.on_time = mp_hal_ticks_ms_non_blocking() + MPERROR_HEARTBEAT_ON_MS;
            }
        } else {
            if (mp_hal_ticks_ms_non_blocking() > mperror_heart_beat.on_time) {
                led_info.color.value = 0;
                led_set_color(&led_info, true, false);
                mperror_heart_beat.beating = false;
                mperror_heart_beat.off_time = mp_hal_ticks_ms_non_blocking() + MPERROR_HEARTBEAT_OFF_MS;
            }
        }
    }
    // let the CPU save some power
    return true;
}
geza-pycom commented 3 years ago

@robert-hh : May I ask that did this last change solve the problem?

robert-hh commented 3 years ago

@geza-pycom

May I ask that did this last change solve the problem?

Yes, it did.

geza-pycom commented 3 years ago

Thanks, I am closing this issue as the corresponding PR will be part of an upcoming release.