reticulatedpines / magiclantern_simplified

A Git based version of Magic Lantern, for those unwilling or unable to work using Mercurial. The vast majority of branches have been removed, with those thought to be important brought in individually and merged.
GNU General Public License v2.0
147 stars 51 forks source link

DIGIC8: Crashes related to STATE_OBJECT_HOOKS and EVF_STATE #45

Open kitor opened 2 years ago

kitor commented 2 years ago

Both R and RP (untested on M50) have random crashes related to EvfCap task. Recently @coon42 got nice trace [1] that sent us into stateobj_lv_spy() from state-object.c. Disabling state objects use (implemented in a89d71f97c620dd93a2a24098b7c0de58da59445) mitigates the issue, but requires investigation in future.

stateobj_lv_spy() is replacement state transition function that we install in EVF_STATE. Quick static analysis yielded nothing, except that we might be oversimplifying. Real state transition function has a couple of checks that if I'm not mistaken - we have not implemented.


7968.524 in menu_open
8090.038 [LVEVFC] ERROR SendEventEvfDev : [12][e005edfb]
8090.081 [STARTUP] ERROR ASSERT : LiveView::EvfCapState.c
8090.106 [STARTUP] ASSERT : Task = EvfCap
8090.109 [STARTUP] ASSERT : Core 0
8090.114 [STARTUP] ASSERT : Line 370
8090.120 [STARTUP] < StackDump >
8090.123 [STARTUP] SP: 0x00213D44

[DM] FROM Write Complete!!!
     3667:  38111.302 SHUTDOWN REASON 1
kitor commented 2 years ago

Looks like we still see some similar crashes on RP. Requires more in-depth testing.

reticulatedpines commented 2 years ago

Possibly useful for diagnosis, a library for producing much more detailed stack information:

Might want a separate ticket, depending on how hard it is to integrate.

kitor commented 2 years ago

Confirmed on other Digic 8 models. Not tested on Digic X yet as it doesn't run LV overlays yet.

Disabling state objects use (mentioned in 1st post) did not fix the issue, it was just a fluke due to randomness / unknown condition that triggers the crash.

In general - all cases are related to some vsync callback timeout.

reticulatedpines commented 2 years ago

Hmm, improved stack traces might not help much with a callback timeout. Still worth a try.

Perhaps a better debugging approach would be a minimal ML with only logging facilities (dm_set_store more stuff at level 3 or 1?), and compare with / without, to try and see what is different.

Alternatively, it may simply be we hold a lock for too long, or otherwise do too much processing in an important window for DryOS. That sounds boring but easy to investigate: pare back ML until the crash disappears, so we can work out what area we're being too demanding in. Quite plausibly our RGB / YUV buffer code.