lancaster-university / codal-microbit-v2

CODAL target for the micro:bit v2.x series of devices
MIT License
43 stars 52 forks source link

Data logging hangs with this sample #141

Open martinwork opened 3 years ago

martinwork commented 3 years ago

MakeCode sample: microbit-datalog.zip Press A to see (in serial): time, readings count, average time to log data, maximum time to log data, and average time for each loop iteration. The same info is sent when full.

The attached hex is built with live, so needs to be loaded into MakeCode beta!

This seems to lock up (the pixel stops flashing) quite often when built with MakeCode beta, and rarely with live. I think the only difference may be that beta is using CODAL v0.2.33.

The timings for the log data block are slightly different between beta and live, and between the initial run after a flash and subsequent runs after reset.

I'm not using WebUSB to flash, but a direct save from MakeCode to the MICROBIT drive. Lock ups seem to occur with both USB and battery power.

martinwork commented 3 years ago

Here's a beta build: microbit-datalog-beta.zip

finneyj commented 3 years ago

oooh, that's curious. Thanks for the report @martinwork.

not that much differnet in the live/beta versions. Just these patches here, designed to workaround KL27 really: https://github.com/lancaster-university/codal-microbit-v2/commit/6bd66c8013cf0109e34a828b0f4eec7ee5e945a5 https://github.com/lancaster-university/codal-microbit-v2/commit/9305a13fa4ec3007f70e5024d37f0965ee485444

I did add in better MUTEX locking of the logging APIs, as that was a little lax in earlier versions: https://github.com/lancaster-university/codal-microbit-v2/commit/d4556e34d0d0c874544e4bafd3ee545cceea1df8

The every() block isn't configured to be re-entrant anyway though right?

finneyj commented 3 years ago

How long does it normally take to exhibit this behaviour @martinwork? I haven't been able to repro it against /beta yet

martinwork commented 3 years ago

The every block currently executes the body then a pause. It has been changed to create a thread for the body, but that doesn't seem to have made it to beta yet.

It does seem to be variable, but much more common with beta. Until this morning it hadn't gone wrong for me with live. I've been trying to make it go wrong more reliably, but decided it was better to post anyway. I was originally working with a version with the serial dump in the forever loop.

Two days ago, it was often hanging as soon as the PC saw the new drive after a flash, and within a minute anyway. Yesterday it didn't seem to happen much. Today it was happening with beta after a minute or so, on USB or battery power. Live usually gets to the on-full handler.

I just flashed from beta. It reached full OK. Pressed reset. Pressed A after 6s and looked away. A few seconds later it has hung.

If it doesn't happen for you then maybe there's something specific here.... I'll keep testing!

finneyj commented 3 years ago

Thanks. I haven't been able to repro it yet, and have reset probably a couple of dozen times.

finneyj commented 3 years ago

p.s. why did we change create a background thread for the body of the every() event?

Wouldn't that get a bit sticky if the body of the code takes longer to run than the period of the event? Could lead to an explosion of background threads?

martinwork commented 3 years ago

https://github.com/microsoft/pxt-microbit/issues/4308 Yes, I was puzzled why it wasn't causing OOM, until I realised the change hasn't got to beta yet, I don't think. That was what I was interested in when I started building the sample.

martinwork commented 3 years ago

The hang is currently not quite 50/50 for me at the moment, but it has just hung on battery power again. I'll try a different micro:bit.

martinwork commented 3 years ago

Added back serial dump to forever loop and a heart for full: microbit-datalog2-beta.zip

New micro:bit. Reached full OK from the flash. Hung after 458 cycles after reset. Both on USB power.

Edit: Reset again - hang after 706 cycles.

finneyj commented 3 years ago

can you try a C++ repro? If we can get a debugger on it, we'll have more information...

finneyj commented 3 years ago

FYI - Mine just stopped too, for the first time. About 1/20 at this end I think :)

martinwork commented 3 years ago

Strange. Yes, I'll see about C++.

finneyj commented 3 years ago

Thanks. Maybe I was just born lucky...

martinwork commented 3 years ago

Here is a C++ sample. It has gone wrong once out of about 10 runs, all on USB power. As it didn't seem to be going wrong at first, I implemented MakeCode style serial routines, though I'm not sure that has had any effect. test-datalog-cpp.zip

This MakeCode version, with a sound on full, completed on first flash, but after reset failed after 3958 cycles, then after 980 cycles on the third attempt: microbit-datalog2-beta.zip

#include "MicroBit.h"

#ifndef MICROBIT_CODAL
#ifdef CODAL_CONFIG_H
#define MICROBIT_CODAL 1
#else
#define MICROBIT_CODAL 0
#endif
#endif

MicroBit uBit;

void togglePixel( int x, int y);

int writeLinePadding = 32;
const char *NEW_LINE = "\r\n";

void serialWriteString( const char *t)
{
    uBit.serial.send( t);
}

void serialWriteLine( const char *text)
{
    serialWriteString(text);
    if (writeLinePadding > 0) {
        int r = ( writeLinePadding - ( strlen(text) + strlen(NEW_LINE)) % writeLinePadding) % writeLinePadding;
        for (int i = 0; i < r; ++i)
            serialWriteString(" ");
    }
    serialWriteString( NEW_LINE);
}

void serialWriteValue( const char *name, int value)
{
    ManagedString m = name ? ManagedString( name) + ManagedString(":") + ManagedString( value) : ManagedString( value);
    serialWriteLine( m.toCharArray());
}

void serialWriteValue( const char *name, double value)
{
    serialWriteValue( name, (int) floor( 0.5 + value));
}

double deltamax = 0;
double deltaavg = 0;
double deltatotal = 0;
double delta = 0;
double now0 = 0;
double now = 0;
double mspercycle = 0;
double count = 0;
double logging = 0;

void debugserial()
{
    serialWriteValue("t", now);
    serialWriteValue("c", count);
    serialWriteValue("a", deltaavg);
    serialWriteValue("m", deltamax);
    serialWriteValue("r", mspercycle);
    serialWriteLine( "");
}

void debuginit()
{
    count = 0;
    logging = 1;
    now0 = 0;
    now = 0;
    delta = 0;
    deltamax = 0;
    deltatotal = 0;
    mspercycle = 0;
}

void debugcalc()
{
    count += 1;
    mspercycle = (now - now0) / count;
    delta = system_timer_current_time() - now;
    deltatotal += delta;
    deltaavg = deltatotal / count;
    deltamax = deltamax >= delta ? deltamax : delta;
}

////////////////////////////////////////////////////////////////

void datalogFull(MicroBitEvent e)
{
    if (logging != 0) {
        logging = 0;
        serialWriteLine("FULL");
        debugserial();
        uBit.display.printChar('F');
        uBit.audio.soundExpressions.play("giggle");
    }
}

void datalog()
{
    debuginit();

    uBit.log.clear(true);
    uBit.log.setSerialMirroring(false);
    uBit.log.setTimeStamp(TimeStampFormat::Milliseconds);

    uBit.messageBus.listen( MICROBIT_ID_LOG, MICROBIT_LOG_EVT_LOG_FULL, datalogFull);

    while (true)
    {
        now = system_timer_current_time();

        if (now0 == 0)
        {
            now0 = now;
        }
        else if ( logging != 0)
        {
          uBit.log.beginRow();

          uBit.log.logData("a", "0");
          uBit.log.logData("b", "0");
          uBit.log.logData("c", "0");
          uBit.log.logData("d", "0");
          uBit.log.logData("e", "0");
          uBit.log.logData("f", "0");
          uBit.log.logData("g", "0");
          uBit.log.logData("h", "0");
          uBit.log.logData("i", "0");
          uBit.log.logData("j", "0");

          uBit.log.endRow();

          debugcalc();
        }

        uBit.sleep(50);
    }
}

void onButtonA(MicroBitEvent e)
{
    debugserial();
}

////////////////////////////////////////////////////////////////

void togglePixel( int x, int y)
{
    uBit.display.image.setPixelValue( x, y, uBit.display.image.getPixelValue( x, y) ? 0 : 255);
}

void forever()
{
    while (true)
    {
        togglePixel( 0, 0);
        uBit.sleep(1000);
        if ( logging != 0)
        {
            debugserial();
        }
    }
}

int  main()
{
    uBit.init();

    create_fiber( forever);
    create_fiber( datalog);

    uBit.messageBus.listen( MICROBIT_ID_BUTTON_A,  MICROBIT_BUTTON_EVT_CLICK, onButtonA);

    release_fiber();
}
finneyj commented 3 years ago

cool - thanks!

When it stopped, did you manage to catch where it was in a debugger @martinwork?

martinwork commented 3 years ago

No. Wasn't in the debugger. Was beginning to think is was a MakeCode problem.

finneyj commented 3 years ago

Managed to catch something like this in a debugger @martinwork. Seems to be stuck trying to format the FLASH. Curious why though - form the last line below, it seems like the FLASH geometry is wrong.... Hopefully just a facet of the new runtime autodetect code...

image

finneyj commented 3 years ago

bah - false alarm... I was just running an old KL27 firmware image. doh.

microbit-carlos commented 2 years ago

@martinwork do you think this is still an issue with latest CODAL and DAPLink 0255 (V2.00 factory release).

martinwork commented 2 years ago

@microbit-carlos Yes, I think it still hangs, but https://github.com/lancaster-university/codal-nrf52/pull/40 resets the I2C and breaks out of the hang.

JohnVidler commented 2 years ago

@martinwork Can you re-run the tests you have for this now that we've merged in 40 in the nrf52 library please?

martinwork commented 2 years ago

I compiled a test with latest master branches. It's the i2c-serial test program, except rather than scroll HANG and stop, it says "hello", sends "HANG" to serial, and continues logging. If left running it will count the number of hangs. source i2c-serial2.zip hex test-i2c-serial2-v2.zip

V2.0 hung after 2200s, then recovered and continued after 13s. V2.2 showed the expected regular serial data loss and hung after 169s, then recovered and continued after 13s.

V2.0 with Interface Version: 0255 (1436bdcc) Typical output:

17292,56,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17308,12,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17324,12,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17340,12,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17444,100,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17468,20,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17484,12,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17500,12,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17556,52,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023

Hang after 2200s, recovered after 13s

2200974,66,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
2200998,18,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
2201030,26,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
2201062,26,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
HANG2215551,13978,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
2215586,29,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
2215622,30,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
2215650,22,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
2215758,102,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023

V2.2 with Interface Version: 0257 (c782a5ba) Typical output (including serial data loss)

17580,28,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17592,8,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17604,8,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17616,8,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17628,8,0,17764,132,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17780,12,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17792,8,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17804,8,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
17836,28,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023

Hang after 169s, recovered after 13s

169284,28,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
169300,12,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
169312,8,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
HANG183528,13709,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
183588,57,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
183605,13,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
183624,16,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
martinwork commented 2 years ago

Having left the test running on V2.0, this morning the micro:bit is stuck showing "F" with the tera term log ending

5745513,35,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
5745557,35,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
FULL 60

I guess it has got stuck because the computer went to sleep.

finneyj commented 2 years ago

Thanks @martinwork @JohnVidler

I think this is as good as we're going to get. As I recall, the I2C thread bottomed out in what appeared to be a silicon bug, where a low level nrf52 event very occasionally gets lost, and doesn't set the appropriate flags in the I2C hardware after a transaction, therefore resulting in the hang. The ~13 sec delay is the associated timeout which then recovers.

Given how rarely this bug surfaces and the fact we can recover, I think we should lower the priority on this one to 'won't fix' but keep live for visibility...

JohnVidler commented 2 years ago

Agreed, especially now that we recover in the event of a hang.

I'm going to demote this and mark as a wontfix, along with a new silicon-bug tag, as ultimately this is likely the issue.

microbit-carlos commented 2 years ago

If we really think this is a silicon bug, then it's important to report this back to Nordic so that they can provide a better workaround.

The only I2C related errata so far is the one at 400 KHz frequency: https://infocenter.nordicsemi.com/index.jsp?topic=%2Fstruct_nrf52%2Fstruct%2Fnrf52833_errata.html

@JohnVidler this is a good opportunity to get a nordic dev zone account set up, as it won't be the last time you'll have to report issues there.

Of course, this can wait until after the CODAL release, but let's keep this open until we've submit it to Nordic.