Complete lockup and possible loss of all presets when I2C occurs during preset save

csboling commented 4 years ago

First reported here: https://llllllll.co/t/kria-freezing-when-saving-presets/34269 though I am quite surprised this has not been seen more commonly.

Steps to reproduce:

I have only Teletype and Ansible on the I2C bus. Adding more devices it seems to be possible to reproduce this at slower clock speeds.
Program some sequence in Kria and save it to the first preset slot. Power cycle to confirm that the sequence was saved to flash.
In TT script 1, I have:
```
KR.CLK 1
X KR.CV 1
```
I would guess other scripts can reproduce the issue, this was just the first thing I isolated that could reliably reproduce the error.
Feed a clock into TT input 1. Attempt to save Ansible presets, gradually increasing the clock speed. At high clock speeds, Ansible will crash, though TT keeps running. Upon restart, sometimes Ansible loads the existing preset, but in some cases all presets are lost. The module takes a while to start up in the latter case -- consistent with the amount of time to set the "fresh" flash state after first reflashing the firmware.

csboling commented 4 years ago

Perhaps a simple protection could be stopping I2C listening for the duration of a preset write (or even for all flash writes?), though I can't yet account for why this can cause flash corruption, and specifically corrupting the "fresh" sentinel byte at the start of nvram.

I actually am not certain how best to "stop listening" for i2c, there doesn't seem to be an obvious libavr32 routine for this. Maybe can bind a reserved address like 0x03? Probably cleaner to just disable all interrupts.

scanner-darkly commented 4 years ago

a flag would have to be added to any functions that are responsible for i2c, maybe even on libavr32 level. i can take a dig. and yeah regardless, disabling all interrupts is a good idea since we have a mechanism for that.

amazing find and great that it’s reproducible. thank you for investigating this!

csboling commented 4 years ago

Meant to add that it does not matter whether TT clocking for Kria is enabled or not. I have also tried:

wrapping the preset commit code here in irqs_pause / irqs_resume
commenting out all the code in the kria handler

and these did not work. However changing the follower address to 0x03 (reserved as "for future use" in the I2C spec) for the duration of the write does appear to fix the crash.

Possible explanation: I seem to recall reading that I2C interrupts were configured with top priority / possibly unmaskable status, so irqs_pause would not actually cover them? If too many i2c interrupts occur (un-acked? should always just ack the interrupt and put a message in the event queue but I'm not immediately finding the ISR code), perhaps the processor hard faults, and if this occurs during a flash write perhaps the whole block of flash could get corrupted? Since this preset save procedure touches parts of flash close to the beginning, this could wind up trashing the sentinel byte.

I can also confirm that the same crash behavior is seen if switching apps while hammering the i2c bus, because the current active app is saved after each app change -- this state is also stored quite close to the front of nvram.

So a possible patch, at least until another resolution can be found, is:

diff --git a/src/ansible_grid.c b/src/ansible_grid.c
index 35168e1..70ed55a 100644
--- a/src/ansible_grid.c
+++ b/src/ansible_grid.c
@@ -377,6 +378,12 @@ void grid_keytimer(void) {

                    // WRITE PRESET

+                   // listen on reserved address to disable i2c during save
+                   // https://github.com/monome/ansible/issues/86
+                   if (!leader_mode) {
+                       init_i2c_follower(0x03);
+                   }
+
                    switch (ansible_mode) {
                    case mGridMP:
                        flashc_memset8((void*)&(f.mp_state.preset), preset, 1, true);
@@ -419,6 +426,8 @@ void grid_keytimer(void) {
                    flashc_memset32((void*)&(f.kria_state.clock_period), clock_period, 4, true);
                    flashc_memset32((void*)&(f.kria_state.sync_mode), kria_sync_mode, sizeof(kria_sync_mode), true);

+                   ii_follower_resume();
+
                    monomeFrameDirty++;
                }
            }

and similar protections in other spots that do flash writes for other apps, where:

void ii_follower_resume(void) {
    if (!leader_mode) {
        switch (ansible_mode) {
        case mArcLevels:
            init_i2c_follower(II_LV_ADDR);
            break;
        case mArcCycles:
            init_i2c_follower(II_CY_ADDR);
            break;
        case mGridKria:
            init_i2c_follower(II_KR_ADDR);
            break;
        case mGridMP:
            init_i2c_follower(II_MP_ADDR);
            break;
        case mGridES:
            init_i2c_follower(ES);
            break;
        case mMidiStandard:
            init_i2c_follower(II_MID_ADDR);
            break;
        case mMidiArp:
            init_i2c_follower(II_ARP_ADDR);
            break;
        case mTT:
            init_i2c_follower(f.state.i2c_addr);
            break;
            default:
            break;
        }
    }
}

Here is a test build, I wrapped all the large save routines and mode changes for each operating mode but I did not wrap every single flash write. It seems to work okay with the I2C traffic generating script given above. I suspect that the problem has to do with I2C flooding during either long-running flash writes or flash that writes close to the start of nvram. ansible.zip

scanner-darkly commented 4 years ago

amazing work!!

However changing the follower address to 0x03 (reserved as "for future use" in the I2C spec) for the duration of the write does appear to fix the crash.

very strange - but if it indeed fixes the issue, that's great!

I seem to recall reading that I2C interrupts were configured with top priority / possibly unmaskable status

what i did was lower priority of UI interrupts here: https://github.com/monome/libavr32/commit/9adf6b79d34814f9cfb7aeaf2a08542f98b94db9

the idea was that other ISRs interrupting i2c would corrupt the bus, so made sense to give them lower priority than that of I2C interrupts. from what i can tell, I2C priority is configured here: https://github.com/monome/libavr32/blob/main/conf/conf_twi.h#L49

and other priorities are configured here: https://github.com/monome/libavr32/blob/main/conf/conf_tc_irq.h#L5

the common function to disable interrupts is here: https://github.com/monome/libavr32/blob/main/src/interrupts.c

so indeed it appears it would not mask I2C interrupts. might be interesting to try changing I2C priority to 2 or disabling the I2C level as well and see if it makes any difference. looks like there is a function here that might be the proper way to do it: https://github.com/monome/libavr32/blob/main/asf/avr32/drivers/twi/twi.c#L349

if (!leader_mode)

should the same fix be applied to leader devices? change them to follower mode temporarily and then change back?

I suspect that the problem has to do with I2C flooding

this also reminded me that i added a buffer for i2c reads https://github.com/monome/libavr32/blob/main/src/i2c.c#L58 - and i wonder if some edge conditions cause out of bounds buffer updates which could possibly explain corruption.. i don't see any way this could happen (unless there are some race conditions maybe?) but take a look and see if anything looks off.

csboling commented 4 years ago

might be interesting to try changing I2C priority to 2 or disabling the I2C level as well and see if it makes any difference.

Didn't have any luck with this, still getting the same crash symptoms.

should the same fix be applied to leader devices? change them to follower mode temporarily and then change back?

I've experimented with the same circumstances when saving and not produced an error in leader mode, I think if I'm reading this right that setting either leader or follower mode replaces the TWI interrupt handler so receiving I2C is disabled in that case anyway.

i wonder if some edge conditions cause out of bounds buffer updates which could possibly explain corruption

Yeah, I don't see how. Even if it went out of bounds it seems really unlikely that it would be able to interfere with flash.

I think I will go ahead and PR the change that resolves the problem for the moment, and we should do a new release, have a few little fixes and improvements, including this problem that makes the USB disk functionality pretty much unusable in v3.0.0.

scanner-darkly commented 4 years ago

sounds good - one thing i want to do before we do a release is add support for disting ex - will do it this week!

tehn commented 4 years ago

i'm not sure why this was closed, i apologize

tehn commented 4 years ago

augh i should not be doing github right now, sorry!

monome / ansible

Complete lockup and possible loss of all presets when I2C occurs during preset save #86