lancaster-university / codal-microbit-v2

CODAL target for the micro:bit v2.x series of devices
MIT License
43 stars 52 forks source link

Erratic output when reading multiple analog pins #81

Closed komodo108 closed 3 years ago

komodo108 commented 3 years ago

Attached is the source & compiled binaries for a minimum C++ & Makecode example demonstrating the issue.

To demonstrate the issue, P0 should be connected to +3.3V while P1 should be connected to ground. Expected output would be a stream of ~0. However, in both programs we can observe regular erratic jumps in output to ~1024.

In my testing I have found that:

martinwork commented 3 years ago

Thanks @komodo108 As you say, this happens with sensors wired to the pins, the same as with direct connections to GND and 3V or floating. P1's value appears at P0 too, but not necessarily at the same time. It still happens when serial isn't involved. With P2 added into the situation, it looks like P0 affects P1 only, P1 affects P2 only, and P2 affects P0 only, even if P2 is first read before P1.

#include "MicroBit.h"

MicroBit uBit;

void test()
{
    while(true) {
        int p0 = uBit.io.P0.getAnalogValue();
        int p1 = uBit.io.P1.getAnalogValue();
        //int p2 = uBit.io.P2.getAnalogValue();

        uBit.display.image.setPixelValue( 4, 0, p0 ? 255 : 0);
        uBit.display.image.setPixelValue( 4, 3, p1 ? 255 : 0);
        //uBit.display.image.setPixelValue( 4, 4, p2 ? 255 : 0);

        if ( p0 != 1023)
          uBit.serial.printf("P0=%d\n", p0);

        if ( p1 != 0)
          uBit.serial.printf("P1=%d\n", p1);

        //if ( p2 != 0)
        //  uBit.serial.printf("P2=%d\n", p1);

        uBit.sleep(10);
    }
}

int main() {
    uBit.init();

    create_fiber(test);
    release_fiber();
}
martinwork commented 3 years ago

It looks like the bad values are in the dma buffer, not generated by a demux error.

finneyj commented 3 years ago

Thanks both @martinwork @komodo108.

Very interesting. I too thought the demux code would be the most likely culprit... I'm afraid we definitely need to keep the demux functionality as there is only one ADC on the nrf52833 - and it's needed for the microphone input in addition to the edge connector analogue read operations.

Based on a quick code review, we are currently requesting a DMA interrupt when the next sample completes after a user request for an instantaneous sample: https://github.com/lancaster-university/codal-nrf52/blob/master/source/NRF52ADC.cpp#L215

My guess is that this is somehow stalling the DMA pipeline, such that future DMA data doesn't have the correct channel interleaving. We could try an algorithm that is less invasive...

If we reset the DMA buffers to some invalid value before use (say 0xFFFF), then when we get a request for a discrete sample, we could simply scan the DMA buffer for the last valid sample with an offset matching the relevant channel then return that value... the advantage being that we can still do that while the DMA buffer is filling up, and without needing to trigger an ADC interrupt.

martinwork commented 3 years ago

In https://github.com/lancaster-university/codal-nrf52/pull/24, memory is leaking in the extra DMA buffer reference. Copying the data stops the leak. https://github.com/lancaster-university/codal-nrf52/pull/24/files#diff-cb9288c49a8d6172ca7e0516425f419e57ddd259465939fcae183888c6634ea6R812

What am I doing wrong @finneyj?

I'm using the test below, with a huge malloc to cut the OOM time to a few seconds. It assumes P0 -> 3V, P1 & P2 -> GND

#include "MicroBit.h"

MicroBit uBit;

void test()
{
    int *reduce_free_space = (int *) malloc( 0x19600);

    while(true) {
#if DEVICE_DMESG_BUFFER_SIZE > 0
        CODAL_TIMESTAMP t0 = system_timer_current_time_us();
#endif

        int p0 = uBit.io.P0.getAnalogValue();
        int p1 = uBit.io.P1.getAnalogValue();
        int p2 = uBit.io.P2.getAnalogValue();

        reduce_free_space[0] = p0;

#if DEVICE_DMESG_BUFFER_SIZE > 0
        CODAL_TIMESTAMP t1 = system_timer_current_time_us();
        CODAL_TIMESTAMP td = t1 - t0;
        if ( td > 24)
            DMESG( "TD %d ", (int) td);
#endif

        uBit.display.image.setPixelValue( 4, 0, p0 ? 255 : 0);
        uBit.display.image.setPixelValue( 4, 3, p1 ? 255 : 0);
        uBit.display.image.setPixelValue( 4, 4, p2 ? 255 : 0);

#if DEVICE_DMESG_BUFFER_SIZE > 0
        if ( p0 < 1023)
            DMESG("P0=%d", p0);

        if ( p1 > 0)
            DMESG("P1=%d", p1);

        if ( p2 > 0)
            DMESG("P2=%d", p2);
#endif

        if ( p0 < 512)
        {
            uBit.sleep(5000);
            microbit_panic(999);
        }

        if ( p1 > 512)
        {
            uBit.sleep(5000);
            microbit_panic(998);
        }

        if ( p2 > 512)
        {
            uBit.sleep(5000);
            microbit_panic(997);
        }

        uBit.sleep(10);
    }

    while(true) {
        uBit.sleep(1000);
    }
}

void togglePixel( int x, int y)
{
    uBit.display.image.setPixelValue( x, y, uBit.display.image.getPixelValue( x, y) ? 0 : 255);
}

void forever()
{
    while (true)
    {
        togglePixel( 0, 0);
        uBit.sleep(1000);
    }
}

int main()
{
    uBit.init();

    create_fiber(test);
    create_fiber(forever);

    release_fiber();
}
finneyj commented 3 years ago

Thanks for looking into this @martinwork! I think you were on the right lines with your branch.

I think the mistake here is that you're trying to manage a ManagedBuffer... they're pretty independent beasts, and best left to do that for themselves. :)

I'd be very happy to give you a walk through the detail sometime, but essentially they are designed to be efficiently passed by value, and when you do so, they maintain their own reference count and safe garbage collection at the end. The same is true for ManagedString. In other words, just pretend they are type like you would see in say Java or C# (or any other statically type garbage collected language), and you won't go far wrong. :)

So I was thinking of an implementation a bit more like this: https://github.com/lancaster-university/codal-nrf52/commit/ed1b1b304d0335e9dda5f9c6a432e3917fffa477

I'm tracking on this feature branch: https://github.com/lancaster-university/codal-nrf52/tree/adc-live-demux

Please note I've also added this patch to the main branch codal-core to make this sort of thing a bit more efficient: https://github.com/lancaster-university/codal-core/commit/f1dd42498e8efb71a7601b155c0ce7c30f722c57

I've run a few quick tests using this program, which seem ok, but we should really test more extensively with the getAnalogueValue() and streams APIs.

void adcTest()
{
    while(1)
    {
        int a1 = uBit.io.P0.getAnalogValue();
        int a2 = uBit.io.P1.getAnalogValue();
        int a3 = uBit.io.P2.getAnalogValue();

        DMESG("[P0: %d] [P1: %d] [P2: %d]", a1, a2, a3);
        uBit.sleep(250);
    }
}

Do you fancy running your tests against this @martinwork ? Also, perhaps @JoshuaAHill could test the latest audio input pipeline against this code to make sure the streaming interfaces still work ok (I think they will). I'll be tagging a release tomorrow for @microbit-carlos. Would be good if we could get this in and pulled through to MakeCode and micropython with the other I/O related fixes. :)

martinwork commented 3 years ago

@finneyj When I first tried a simple assignment of the buffer, I was getting 030 panics.

finneyj commented 3 years ago

Thanks @martinwork.

Interesting. 030 is HEAP_ERROR rather than out of memory...

We'd have to look at that original code to be sure, but perhaps could be caused by a reference count leak due to the ADC IRQ racing a ManagedBuffer assignment? Doesn't look like it's protected: https://github.com/lancaster-university/codal-core/blob/master/source/types/ManagedBuffer.cpp#L182

nor the copy constructor: https://github.com/lancaster-university/codal-core/blob/master/source/types/ManagedBuffer.cpp#L108

Maybe that's what you were seeing?

finneyj commented 3 years ago

If so, we'll likely see it in the branch I shared above too... and disabling interrupts whilst performing this operation would fix it? https://github.com/lancaster-university/codal-nrf52/blob/adc-live-demux/source/NRF52ADC.cpp#L655

...I think. :)

martinwork commented 3 years ago

When I saw 030, I thought it must be because the buffer was being freed in an ADC interrupt half way through my buffer constructor, so I tried disabling IRQs for the assignment, then I started seeing 020. Thinking this was somehow caused by an ADC interrupt during my buffer destructor, I tried the monster implementation with a pointer, but this didn't get rid of the 020s.

I've been running tests and everything seems OK most of the time, except the first readings are 0. But sometimes I'm seeing 0, 0,1023 instead of 1023, 0, 0. Haven't figured out why yet...

Oops! I've just looked down to see 030 after a few hundred iterations of this as a fiber:

void adcTest()
{
    int *reduce_free_space = (int *) malloc( 0x19000);

    while(1)
    {
        int a1 = uBit.io.P0.getAnalogValue();
        int a2 = uBit.io.P1.getAnalogValue();
        int a3 = uBit.io.P2.getAnalogValue();

        reduce_free_space[0] = a1;

        DMESG("[P0: %d] [P1: %d] [P2: %d]", a1, a2, a3);
        uBit.sleep(250);
    }
}
martinwork commented 3 years ago

I re-cloned microbit-v2-samples, built the test, then did git checkout adc-live-demux in the libraries\codal-nrf52 folder, and re-built the test. Did I miss anything?

finneyj commented 3 years ago

nope - that sounds right @martinwork. Thanks. Sounds like we have confirmation of the bug you found!

I just pushed this onto the adc-live-demux branch....

can you pull this through and see if it helps?

/**
 * Gets the currently active DMA buffer (the one currently being filled)
 * @return the currently active DMA buffer
 */
ManagedBuffer NRF52ADC::getActiveDMABuffer()
{
    ManagedBuffer b;

    target_disable_irq();
    b = dma[activeDMA];
    target_enable_irq();

    return b;
}
finneyj commented 3 years ago

@martinwork I've been running some tests on the latest branch too. Thought I'd share in case you were looking too.

Three observations, based on running the following code, and hardware configured to have:

P0: GND P1: Vcc/2 P2: Vcc

void
printFreeHeapMemory()
{
    while(1)
    {
        int f = device_free_heap();
        DMESG("FREE HEAP MEMORY: %d", f);

        //wait for 1 minute
        uBit.sleep(60000);
    }
}

void adcTest()
{
    DMESG("ADC TEST START");

    create_fiber(printFreeHeapMemory);

    while(1)
    {
        int a1 = uBit.io.P0.getAnalogValue();
        int a2 = uBit.io.P1.getAnalogValue();
        int a3 = uBit.io.P2.getAnalogValue();

        if (a1 > 25 || a2 <487 || a2 > 537 || a3 < 998 || a3 > 1024)
            DMESG("[P0: %d] [P1: %d] [P2: %d]", a1, a2, a3);

        uBit.sleep(10);
    }
}

1) I can't make this version of the code sporadically return the wrong value. Note I've really clocked up the rate at which we read the pins too. I ran this for about 50mins, with no incorrect values.

2) I can make this version deterministically return the wrong values. :) I don't know why, but when the code start (i.e. power on/reset) sometimes it seems to put the channels in the incorrect order. They then consistently report the wrong values until reset. At which point the behaviour may or may not revert to the correct behaviour.

3) There is some odd but inconclusive memory behaviour going on. The output from the heap trace looks likes this:

Time Output
1 FREE HEAP MEMORY: 104664
2 FREE HEAP MEMORY: 104596
3 FREE HEAP MEMORY: 104596
4 FREE HEAP MEMORY: 104076
5 FREE HEAP MEMORY: 104076
6 FREE HEAP MEMORY: 104076
7 FREE HEAP MEMORY: 103036
8 FREE HEAP MEMORY: 102516
9 FREE HEAP MEMORY: 102516
10 FREE HEAP MEMORY: 102516
11 FREE HEAP MEMORY: 102516
12 FREE HEAP MEMORY: 102516
13 FREE HEAP MEMORY: 102516
14 FREE HEAP MEMORY: 102516
15 FREE HEAP MEMORY: 102516
16 FREE HEAP MEMORY: 102516
17 FREE HEAP MEMORY: 102516
18 FREE HEAP MEMORY: 102516
19 FREE HEAP MEMORY: 102516
20 FREE HEAP MEMORY: 102516
21 FREE HEAP MEMORY: 102516
22 FREE HEAP MEMORY: 102516
23 FREE HEAP MEMORY: 102516
24 FREE HEAP MEMORY: 102516
25 FREE HEAP MEMORY: 102516
26 FREE HEAP MEMORY: 102516
27 FREE HEAP MEMORY: 102516
28 FREE HEAP MEMORY: 102516
29 FREE HEAP MEMORY: 102516
30 FREE HEAP MEMORY: 102516
31 FREE HEAP MEMORY: 102516
32 FREE HEAP MEMORY: 102516
33 FREE HEAP MEMORY: 102516
34 FREE HEAP MEMORY: 102516
35 FREE HEAP MEMORY: 102516
36 FREE HEAP MEMORY: 102516
37 FREE HEAP MEMORY: 102516
38 FREE HEAP MEMORY: 102516
39 FREE HEAP MEMORY: 102516
40 FREE HEAP MEMORY: 102516
41 FREE HEAP MEMORY: 102516
42 FREE HEAP MEMORY: 102516
43 FREE HEAP MEMORY: 102516
44 FREE HEAP MEMORY: 102516
45 FREE HEAP MEMORY: 102516
46 FREE HEAP MEMORY: 102516
47 FREE HEAP MEMORY: 102516
48 FREE HEAP MEMORY: 102516
49 FREE HEAP MEMORY: 102516
50 FREE HEAP MEMORY: 102516
51 FREE HEAP MEMORY: 102516

The drop in free memory corresponds is 520 bytes. That's consistent with a 512 byte DMA buffer and associated overhead. However, notice we only seem to "lose" 4 in total. It seems to stabilise at 102516... So could be those buffers actually sitting somewhere for a reason (although I'm not sure what!)

So conclusions:

finneyj commented 3 years ago

OK.... Just made a very interesting patch that seems to resolve the issue with the out of order channels. I think this has the effect of resetting the multiplexing hardware on the ADC, so it doesn't then start out of order later. Looks weird in code though!

https://github.com/lancaster-university/codal-nrf52/commit/9f8bb9f6107da7aaada8df07f0f5d7d8c14f0a28

finneyj commented 3 years ago

nope - not that... and if you spin that loop without any delays, we're definitely burning memory very quickly: image

finneyj commented 3 years ago

OK I've traced the HEAP_ERROR @martinwork. Caused by a data race condition on the ManagedBuffer reference counters. Fixed in this patch. When running this, I don't see any memory leakage or 030 panics. https://github.com/lancaster-university/codal-core/commit/caee1151f0fdfb3a6678f9c269e7b33c9163bfea

Also, I think that this patch may also have fixed the channel reordering bug: https://github.com/lancaster-university/codal-nrf52/commit/b7808eef203bc6635ed927b56e1cb2984d42aeda

The channel reordering bug is very hard to reproduce though, so if you have any tests that might trigger it, could you try them out on the HEAD of adc-live-demux branch?

martinwork commented 3 years ago

OK, I was about to report that after https://github.com/lancaster-university/codal-microbit-v2/issues/81#issuecomment-798916876, I left it running on batteries and it was showing 020 last thing last night.

I'll try the new version.

microbit-carlos commented 3 years ago

@martinwork are you able to run your tests today? We want to create a new CODAL tag ad wondering if we are waiting for this fix to make it in or out.

martinwork commented 3 years ago

Yes, I am testing now, trying to create a test to leave running.

The first readings are still always zero and I have just seen the values out of order again (running with P0->3v, P1&2->GND): [P0: 0] [P1: 0] [P2: 0]; [P0: 0] [P1: 1023] [P2: 0]. Previously, I was seeing 0,0,1023. I could be wrong (it seems to happen when I don't expect it!) but maybe it's more common on the first run after a flash?

martinwork commented 3 years ago

I'm still re-cloning as in https://github.com/lancaster-university/codal-microbit-v2/issues/81#issuecomment-798915954. Please let me know, @finneyj, if I need to do more to get all the patches.

martinwork commented 3 years ago

It's interesting that 3 weeks ago, I could malloc 0x19600, but now I can only reserve 0x19000.

finneyj commented 3 years ago

Thanks @martin. If you have HEAD revision of codal-core and HEAD revision of codal-nrf52#adc-live-demux then you should be good.

I think the difference in memory alloc is perhaps that we're now holding onto one more DMA buffer on average? (due to the getSample API?)

finneyj commented 3 years ago

@martinwork

"The first readings are still always zero and I have just seen the values out of order again (running with P0->3v, P1&2->GND): [P0: 0] [P1: 0] [P2: 0]; [P0: 0] [P1: 1023] [P2: 0]. Previously, I was seeing 0,0,1023. I could be wrong (it seems to happen when I don't expect it!) but maybe it's more common on the first run after a flash?"

BAH! This is precisely the thing I was seeing... I thought the latest patch had fixed that. I couldn't repro on my setup.

martinwork commented 3 years ago

I'm just re-cloning, in case I was testing the wrong version.

finneyj commented 3 years ago

thanks @martinwork

martinwork commented 3 years ago

analog-problem.zip

This is my test... currently running test() rather than adcTest(). Right now it's going wrong every time.

finneyj commented 3 years ago

good to know - thanks @martinwork .

It's super deterministic. Once it latches onto the incorrect mux pattern, it sticks with it!

finneyj commented 3 years ago

p.s. heh - great minds and all that @martinwork. :)

This is the test harness I wrote last night:

void adcTest()
{
    DMESG("ADC TEST START");

    //create_fiber(printFreeHeapMemory);

    while(1)
    {
        int a1 = uBit.io.P0.getAnalogValue();
        int a2 = uBit.io.P1.getAnalogValue();
        int a3 = uBit.io.P2.getAnalogValue();

        if (uBit.buttonA.isPressed())
            DMESGF("[P0: %d] [P1: %d] [P2: %d]", a1, a2, a3);

        if (uBit.buttonB.isPressed())
            DMESGF("FREE HEAP MEMORY: %d", device_free_heap());

        uBit.display.image.setPixelValue(0,0, (a1 > 25) ? 255 : 0);
        uBit.display.image.setPixelValue(2,0, (a2 < 487 || a2 > 537) ? 255 : 0);
        uBit.display.image.setPixelValue(4,0, (a3 < 998 || a3 > 1024) ? 255 : 0);
    }
}

I'll try to repro yours on my hardware

martinwork commented 3 years ago

I'm assuming the data is out-of-order in the DMA buffer, as it used to be. There seem to be quite a few questions on the Nordic forum about this. But I noticed before that enabling and disabling channels puts enabledChannels out of sync with the data in dmaBuffer, so the next demux in EVENTS_END/STOPPED or getSample() will record junk (and maybe the next because of the END/START link?).

finneyj commented 3 years ago

Good point w.r.t. enabledChannels being out of sync for the first cycle. We should probably defer that until servicePendingEvents(). I'm not sure why that would cause us to be systematically for all following cases though. :-/

Maybe we should take a peep inside the DMA buffer, to verify that the data is indeed out of order (pretty sure it is, but there's nothing like checking!)

martinwork commented 3 years ago

I'm not sure why that would cause us to be systematically for all following cases though. :-/

Me neither! Just seem appropriate to mention it. I guess the first readings are all zero because isEnabled() fails.

finneyj commented 3 years ago

The DMA buffer is definitely out of order. I just put a DMESG into getSample() and managed to capture cases of correct and out of order output:

[D: 4]
[P0: 0] [P1: 513] [P2: 1019]

[D: 8220]
[P0: 511] [P1: 1018] [P2: 0]

The [D:] line is displaying the contents of the first 16 bit value in the DMA buffer...

martinwork commented 3 years ago

But those were on different runs, I guess? Could this be basically the same issue that was causing the value swaps, but occurring once when all the channels are added? Do you think the timer/TASKS_SAMPLE PPI link is getting going but at some point the buffering is reset?Maybe, the ADC enable, start, and PPI link configuration should be gathered in a sequence and all torn down in order to stop for configuration changes. Maybe we need a test sample that switches one input between analogue and digital?

finneyj commented 3 years ago

Yes, those were on different runs @martinwork.

I think you're right - this is the same "feature" we were seeing before, but not constrained (I think) to when channels are added/removed. I'm feeling pretty confident it's to do with when we perform a STOP/START on the ADC.

I've been thinking on similar lines. My latest test looks like this in the STOP IRQ handler:

        if(NRF_SAADC->EVENTS_STOPPED){
            NRF_SAADC->EVENTS_STOPPED = 0;

            if ((status & NRF52ADC_STATUS_PERIOD_CHANGED) || (channelsChanged > 0 && enabledChannels > 0))
            {
                status &= ~NRF52ADC_STATUS_PERIOD_CHANGED;

                timer.disable();
                NRF_SAADC->ENABLE = 0;
                setSamplePeriod(samplePeriod);
                NRF_SAADC->RESULT.PTR = NRF_SAADC->RESULT.PTR;
                NRF_SAADC->RESULT.MAXCNT = NRF52ADC_DMA_ALIGNED_SIZED(enabledChannels);  
                NRF_SAADC->ENABLE = 1;
                NRF_SAADC->TASKS_START = 1;
                while(!NRF_SAADC->EVENTS_STARTED);
                timer.enable();
            }
        }

Which really should flush everything you would think, but still has little effect!

finneyj commented 3 years ago

OK - this looks like it's going to take more time to figure out, and more testing to give ourselves confidence that we have a stable solution.

I need to get a tag out today, so i'm afraid we're going to have to schedule this one for the next release...

I'm keen to find this, so let's keep investigating though!

martinwork commented 3 years ago

I added some trace in the IRQ handler. When the STOPPED event arrives, END is also true, and STARTED becomes true by the time it gets to the re-config code, because END has already triggered START. I can't find anything that says what ENABLE=0 does when a sample task has already been triggered or how to get it to restart with the first sensor. The buffer at PTR does seem to get re-filled from the start, but with the wrong values.

It occurred to me that my PR was working other than the OOM issue, which you have now fixed, so it might be worth comparing. I've changed it back to a simple assignment of the buffer and have it on test.

This has some of the trace I used.

void NRF52ADC::irq()
{
    DMESGN( "irq ");
    DMESGN( NRF_SAADC->EVENTS_STARTED       ? "S" : " ");
    DMESGN( NRF_SAADC->EVENTS_STOPPED       ? "T" : " ");
    DMESGN( NRF_SAADC->EVENTS_END           ? "E" : " ");
    DMESGN( NRF_SAADC->EVENTS_DONE          ? "D" : " ");
    DMESGN( NRF_SAADC->EVENTS_RESULTDONE    ? "R" : " ");
    DMESG(  NRF_SAADC->EVENTS_CALIBRATEDONE ? "C" : " ");

    if (NRF_SAADC->EVENTS_END || NRF_SAADC->EVENTS_STOPPED)
    {
        DMESG( "END/STOPPED");
        // Snapshot the buffer we just received into
        int completeBuffer = activeDMA;
        int offset = 0;
        int channelsChanged = 0;

        // Flip to our other buffer.
        activeDMA = (activeDMA + 1) % 2;

#if DEVICE_DMESG_BUFFER_SIZE > 0
        //int a  = (int)NRF_SAADC->RESULT.AMOUNT;
        //int b0 = (int)*(int16_t *)&(dma[completeBuffer][0]) / 16;
        //int b1 = (int)*(int16_t *)&(dma[completeBuffer][2]) / 16;
        //if ( ( a % enabledChannels) || b0 < 1000 || b1 > 20)
        //    DMESG( "%d %d %d", a, b0, b1);
#endif

        // truncate the buffer if we didn't fill it...
        dma[completeBuffer].truncate(NRF_SAADC->RESULT.AMOUNT*2);

        // Process the buffer.
        // TODO: Consider moving this outside the interrupt context...

        for (int channel = 0; channel < NRF52_ADC_CHANNELS; channel++)
        {
            if (NRF_SAADC->CH[channel].PSELP)
                channels[channel].demux(dma[completeBuffer], offset++, enabledChannels, softwareOversample);

            // Callback 
            channelsChanged += channels[channel].servicePendingRequests();
        }

        // Indicate we've processed the interrupt
        if(NRF_SAADC->EVENTS_END)
            NRF_SAADC->EVENTS_END = 0;

        // If we've been stopped, the number of active channels may have changed. 
        // Recalculate DMA buffer size, ensure the our target DMA buffer sample count is a multiple of the channel count.
        // Recalculate OVERSAMPLE and timer settings accordingly.
        if(NRF_SAADC->EVENTS_STOPPED){
            NRF_SAADC->EVENTS_STOPPED = 0;

            if ((status & NRF52ADC_STATUS_PERIOD_CHANGED) || (channelsChanged > 0 && enabledChannels > 0))
            {
                DMESG( "CHANGED with STARTED = %d", (int) NRF_SAADC->EVENTS_STARTED);
                status &= ~NRF52ADC_STATUS_PERIOD_CHANGED;

                timer.disable();
                NRF_SAADC->ENABLE = 0;
                NRF_SAADC->RESULT.PTR = NRF_SAADC->RESULT.PTR;
                NRF_SAADC->RESULT.MAXCNT = NRF52ADC_DMA_ALIGNED_SIZED(enabledChannels);  
                setSamplePeriod(samplePeriod);
                NRF_SAADC->ENABLE = 1;
                NRF_SAADC->TASKS_START = 1;
                while(!NRF_SAADC->EVENTS_STARTED);
                timer.enable();
            }
        }

        NRF_SAADC->EVENTS_RESULTDONE = 0;

        // Stop sampling if no channels are active.
        if (enabledChannels == 0)
            NRF_SAADC->ENABLE = 0;
    }

    if (NRF_SAADC->EVENTS_STARTED)
    {
        DMESG( "STARTED");
        int nextDMA = (activeDMA + 1) % 2;
        dma[nextDMA] = allocateDMABuffer();
        NRF_SAADC->RESULT.PTR = (uint32_t) &dma[nextDMA][0];
        NRF_SAADC->EVENTS_STARTED = 0;
    }
}
finneyj commented 3 years ago

Thanks @martinwork.

Fully agree - I'm also pretty confident that's where the issue resides...

Definitely worth looking at your earlier branch. This bug is so elusive we need all the data points we can get!

finneyj commented 3 years ago

Hmm... I have an idea.

Currently, channels get added whilst the ADC is running, but then the ADC cycles itself to resync. I wonder if this is causing the desynchronization in the DMA buffer...

I've worked up a patch onto the HEAD of the adc-live-demux channel. This does seemingly work on both my tests and yours... How does it look to you @martinwork ?

martinwork commented 3 years ago

That makes sense to me. I tried something similar on another attempt to make it work.

Either it's getting late or I just entered the twilight zone...

With this source: analog-problem.zip codal.json.zip

I built these hex files: analog-problem-hex.zip

Connected P0->3V; P1->GND; P2->GND.

Running adcTest() (as in the source zip), it seems solid, except the first values are zero. I didn't leave it running for long.

Running test(), I always get panic 999 because the first value tested is wrong.

Running test() without the debug code that calls system_timer_current_time_us(), it did panic 999 immediately once on start-up after a flash, but usually just has slightly variable pin values. But after a few minutes running, I just got a panic 997 because P2 is 4092!

Terminal logs...

Running adcTest() - solid

relocate vtor to 0x0 -> 0x20002200 0x20002200
[P0: 0] [P1: 0] [P2: 0]
[P0: 0] [P1: 1023] [P2: 0]
[P0: 0] [P1: 1023] [P2: 0]
[P0: 0] [P1: 1023] [P2: 0]
[P0: 0] [P1: 1023] [P2: 0]
[P0: 0] [P1: 1023] [P2: 0]

Running test() - panic 999

relocate vtor to 0x0 -> 0x20002200 0x20002200
P0=0
P1=1023

Running test with no time calls (initial flash and after a reset) - variable values

relocate vtor to 0x0 -> 0x20002200 0x20002200
P0=1021
P0=1021
P1=1
P2=1
P0=1021
P2=1
P0=1021
P2=1
P2=1
P0=1022
P2=1
P2=3
relocate vtor to 0x0 -> 0x20002200 0x20002200
P2=2
P2=1
P0=1022
P0=1022
P0=1022
P2=2
P2=1
P0=1021
P2=1
P1=1
P0=1018
P0=1022
P0=1022

After a few minutes running:

P0=1016
P0=1019
P0=1021
P2=2
P0=1022
P2=1
P2=1
P1=2
P2=3
P2=4
P2=1
P2=5
P2=4092
finneyj commented 3 years ago

Thanks @martinwork.

That looks promising! if it's just the first values that are wrong, that would be real progress. Do you think we've reached the point where we can conclude that yet?

I've seen the occasional crazy high value too - very rare, but there! I think it's caused by a negative sample. i.e. the data is below the negative rail. The ADC used signed 2s complement values for the samples, so a crazy high value kinda makes sense. However, I think i'd trapped that in the code - really not sure why it doesn't work:

https://github.com/lancaster-university/codal-nrf52/blob/adc-live-demux/source/NRF52ADC.cpp#L245

unless it's a 14bit 2s complement number?? Surely Nordic aren't that evil...

martinwork commented 3 years ago

When I said "the first value tested is wrong"... the first values are always zero, so not tested. In test(), the rest are consistently wrong. In test() without calls to calls system_timer_current_time_us(), I got a different high value: 4094.

My last comment was meant to say that minor variations in the code, like calling system_timer_current_time_us() or modifying the DMESG calls are causing different behaviour for me. Could this be my compiler set-up? I'm going to try the docker build.

I tested my PR with test() (minus the memory limitation). There can't have been high values on the grounded pins or swapped values. It would have triggered panics. It just eventually ran out of memory after many hours.

Here's a new test. For me, the analogue values are usually wrong, and micro:bit stops responding when it switches P2 from analogue to digital.

void adcTest2()
{
    int count = -1;

    while(1)
    {
        int a1 = uBit.io.P0.getAnalogValue();
        int a2 = uBit.io.P1.getAnalogValue();
        int a3 = count >= 100 ? uBit.io.P2.getAnalogValue() : uBit.io.P2.getDigitalValue();

        DMESG("[P0: %d] [P1: %d] [P2: %d] [count: %d]", a1, a2, a3, count);

        count = ( count + 1) % 200;

        uBit.sleep(0);
    }
}
finneyj commented 3 years ago

ah, ok @martinwork. You got my hopes up there! if the values are still consistently wrong in at least one of your tests, then we still have the same problem... sigh.

I'm running out of ideals about how to reset this ADC fully!

finneyj commented 3 years ago

Maybe if we

martinwork commented 3 years ago

Yes, I'm thinking, from outside the IRQ handler: stop the ADC, wait for it to stop, change channels or other parameters, start the ADC. The ADC can't be fully stopped without disabling the END/START and timer/SAMPLE PPI links.

I've just experimented with using getEnabledChannel from https://github.com/martinwork/codal-nrf52/blob/e5dd1f4b7cd92b1775eb25de74f1b10d8ab283e9/source/NRF52ADC.cpp#L696. The values were correct. Maybe it helps to wait for the ADC to get fully restarted with the new channel configuration, before asking for another change. But that's really just a hack, the waiting is slow, and it doesn't fix what happens when a channel is switched from analogue to digital. With some of the last few commits from adc-live-demux reverted, switching a pin to digital didn't crash, but the remaining analogue channels became swapped again.

I reckon this same problem will keep coming back as long as the configuration is changed inside the IRQ handler with the PPI links running.

finneyj commented 3 years ago

Agree - let's refactor using a "scorched earth" policy...

Do you fancy this one, or shall I?

martinwork commented 3 years ago

Let me! I've got other stuff I have been neglecting, but I'll aim to get this done over the weekend, OK?

finneyj commented 3 years ago

Go for it!

martinwork commented 3 years ago

@finneyj I've made a branch off your adc-live-demux branch, with one commit https://github.com/martinwork/codal-nrf52/commit/a7268d64c09929d9081a918845a1082794e7d26d

I think this fixes the large/negative value problem: https://github.com/martinwork/codal-nrf52/blob/a7268d64c09929d9081a918845a1082794e7d26d/source/NRF52ADC.cpp#L205

This is to ensure the first getSample() has a valid value to return: https://github.com/martinwork/codal-nrf52/blob/a7268d64c09929d9081a918845a1082794e7d26d/source/NRF52ADC.cpp#L750

But this test, which repeatedly randomizes the analogue channels, will eventually see a round of swapped values. analog-problem.zip

And he first values read after a start can be wrong, as if a settle time is needed.

I haven't done anything with setGain. Does a change in gain need to restart the sampling?

finneyj commented 3 years ago

Thanks @martinwork.

Branch looks good. I'd be happy to merge that into adc-live-demux if you like? Let me know.

Disappointing it still doesn't completely resolve the problem though. I don't know if changes to gain require a restart - I didn't think so (as it doesn't change the number of channels in the ADC, or the MUX ordering). I guess we could try it though.

martinwork commented 3 years ago

I've added a new commit. With this change, my test has been running for an hour now - much longer than before. I'll create a PR to your adc-live-demux.

finneyj commented 3 years ago

oooh - great! What did you change @martinwork ?