Closed csboling closed 4 years ago
Perhaps a simple protection could be stopping I2C listening for the duration of a preset write (or even for all flash writes?), though I can't yet account for why this can cause flash corruption, and specifically corrupting the "fresh" sentinel byte at the start of nvram.
I actually am not certain how best to "stop listening" for i2c, there doesn't seem to be an obvious libavr32 routine for this. Maybe can bind a reserved address like 0x03? Probably cleaner to just disable all interrupts.
a flag would have to be added to any functions that are responsible for i2c, maybe even on libavr32 level. i can take a dig. and yeah regardless, disabling all interrupts is a good idea since we have a mechanism for that.
amazing find and great that it’s reproducible. thank you for investigating this!
Meant to add that it does not matter whether TT clocking for Kria is enabled or not. I have also tried:
irqs_pause
/ irqs_resume
and these did not work. However changing the follower address to 0x03 (reserved as "for future use" in the I2C spec) for the duration of the write does appear to fix the crash.
Possible explanation: I seem to recall reading that I2C interrupts were configured with top priority / possibly unmaskable status, so irqs_pause
would not actually cover them? If too many i2c interrupts occur (un-acked? should always just ack the interrupt and put a message in the event queue but I'm not immediately finding the ISR code), perhaps the processor hard faults, and if this occurs during a flash write perhaps the whole block of flash could get corrupted? Since this preset save procedure touches parts of flash close to the beginning, this could wind up trashing the sentinel byte.
I can also confirm that the same crash behavior is seen if switching apps while hammering the i2c bus, because the current active app is saved after each app change -- this state is also stored quite close to the front of nvram.
So a possible patch, at least until another resolution can be found, is:
diff --git a/src/ansible_grid.c b/src/ansible_grid.c
index 35168e1..70ed55a 100644
--- a/src/ansible_grid.c
+++ b/src/ansible_grid.c
@@ -377,6 +378,12 @@ void grid_keytimer(void) {
// WRITE PRESET
+ // listen on reserved address to disable i2c during save
+ // https://github.com/monome/ansible/issues/86
+ if (!leader_mode) {
+ init_i2c_follower(0x03);
+ }
+
switch (ansible_mode) {
case mGridMP:
flashc_memset8((void*)&(f.mp_state.preset), preset, 1, true);
@@ -419,6 +426,8 @@ void grid_keytimer(void) {
flashc_memset32((void*)&(f.kria_state.clock_period), clock_period, 4, true);
flashc_memset32((void*)&(f.kria_state.sync_mode), kria_sync_mode, sizeof(kria_sync_mode), true);
+ ii_follower_resume();
+
monomeFrameDirty++;
}
}
and similar protections in other spots that do flash writes for other apps, where:
void ii_follower_resume(void) {
if (!leader_mode) {
switch (ansible_mode) {
case mArcLevels:
init_i2c_follower(II_LV_ADDR);
break;
case mArcCycles:
init_i2c_follower(II_CY_ADDR);
break;
case mGridKria:
init_i2c_follower(II_KR_ADDR);
break;
case mGridMP:
init_i2c_follower(II_MP_ADDR);
break;
case mGridES:
init_i2c_follower(ES);
break;
case mMidiStandard:
init_i2c_follower(II_MID_ADDR);
break;
case mMidiArp:
init_i2c_follower(II_ARP_ADDR);
break;
case mTT:
init_i2c_follower(f.state.i2c_addr);
break;
default:
break;
}
}
}
Here is a test build, I wrapped all the large save routines and mode changes for each operating mode but I did not wrap every single flash write. It seems to work okay with the I2C traffic generating script given above. I suspect that the problem has to do with I2C flooding during either long-running flash writes or flash that writes close to the start of nvram. ansible.zip
amazing work!!
However changing the follower address to 0x03 (reserved as "for future use" in the I2C spec) for the duration of the write does appear to fix the crash.
very strange - but if it indeed fixes the issue, that's great!
I seem to recall reading that I2C interrupts were configured with top priority / possibly unmaskable status
what i did was lower priority of UI interrupts here: https://github.com/monome/libavr32/commit/9adf6b79d34814f9cfb7aeaf2a08542f98b94db9
the idea was that other ISRs interrupting i2c would corrupt the bus, so made sense to give them lower priority than that of I2C interrupts. from what i can tell, I2C priority is configured here: https://github.com/monome/libavr32/blob/main/conf/conf_twi.h#L49
and other priorities are configured here: https://github.com/monome/libavr32/blob/main/conf/conf_tc_irq.h#L5
the common function to disable interrupts is here: https://github.com/monome/libavr32/blob/main/src/interrupts.c
so indeed it appears it would not mask I2C interrupts. might be interesting to try changing I2C priority to 2 or disabling the I2C level as well and see if it makes any difference. looks like there is a function here that might be the proper way to do it: https://github.com/monome/libavr32/blob/main/asf/avr32/drivers/twi/twi.c#L349
if (!leader_mode)
should the same fix be applied to leader devices? change them to follower mode temporarily and then change back?
I suspect that the problem has to do with I2C flooding
this also reminded me that i added a buffer for i2c reads https://github.com/monome/libavr32/blob/main/src/i2c.c#L58 - and i wonder if some edge conditions cause out of bounds buffer updates which could possibly explain corruption.. i don't see any way this could happen (unless there are some race conditions maybe?) but take a look and see if anything looks off.
might be interesting to try changing I2C priority to 2 or disabling the I2C level as well and see if it makes any difference.
Didn't have any luck with this, still getting the same crash symptoms.
should the same fix be applied to leader devices? change them to follower mode temporarily and then change back?
I've experimented with the same circumstances when saving and not produced an error in leader mode, I think if I'm reading this right that setting either leader or follower mode replaces the TWI interrupt handler so receiving I2C is disabled in that case anyway.
i wonder if some edge conditions cause out of bounds buffer updates which could possibly explain corruption
Yeah, I don't see how. Even if it went out of bounds it seems really unlikely that it would be able to interfere with flash.
I think I will go ahead and PR the change that resolves the problem for the moment, and we should do a new release, have a few little fixes and improvements, including this problem that makes the USB disk functionality pretty much unusable in v3.0.0.
sounds good - one thing i want to do before we do a release is add support for disting ex - will do it this week!
i'm not sure why this was closed, i apologize
augh i should not be doing github right now, sorry!
First reported here: https://llllllll.co/t/kria-freezing-when-saving-presets/34269 though I am quite surprised this has not been seen more commonly.
Steps to reproduce:
I would guess other scripts can reproduce the issue, this was just the first thing I isolated that could reliably reproduce the error.