Closed ghost closed 8 years ago
OK so as I understand it, net_midi.c is a doubly linked list. The active midi ops in there make a circular linked list structure. The error condition 'error unlinking midi operator' should only ever fire if net_midi_list_remove is called twice on the same midi op.
That should surely never happen - possibly gives another clue to what might be going on...
oy sorry, i didn't see this earlier. yeah it really looks like ops are being removed twice from the list... this only happens in the op deinit function. could deinit be called twice somehow?
the list managemetn stuff itself is awfully simple. and yeah, it doesn't need to be doubly linked really.
Just added two commits with some more debug - now I'm getting a crash like this:
chunks per message: 1 , op class: 47 , size: 96 ; allocating... op creation failed; too many inputs in network. chunks per message: 1 , op class: 6 , size: 84 ; allocating... chunks per message: 1 , op class: 46 , size: 68 ; allocating... op creation failed; too many inputs in network. chunks per message: 1 , op class: 6 , size: 84 ; allocating... op creation failed; too many inputs in network. chunks per message: 1 , op class: 20 , size: 128 ; allocating... op creation failed; too many inputs in network. chunks per message: 1 , op class: 6 , size: 84 ; allocating... requested push of midi op already in list - bailing op creation failed; too many inputs in network.
that 'bail on midi op already in list' was silent in the original code. Needs some more debug on entering/leaving net_midi_list_push & net_midi_list_remove...
yeah.. something funny is happening when add_op
fails... and sequence of events is something like
op_init
, which calls net_midi_list_push
op_deinit
which should unlink it from the list againadd_op
should exit early before doing anything else(sorry i know this is obvious, just going through it)
next time a midi op is made, it reuses the chunk address? and it's already linked? that's the part that's weird.
Very weird! In fact I'm going to try printing the hex address of new midi ops - most obvious possibility would be that the new op pool code is playing up, allocating the same memory chunk twice...
It doesn't look like deinit is called twice...
With the mem address printed on every entry to midi_list_push & midi_list_remove:
chunks per message: 1 , op class: 6 , size: 84 ; allocating... entered net_midi_list_push D0042944 net_midi_list_push exited normally op creation failed; too many inputs in network. entered net_midi_list_remove D0042944 net_midi_list_remove exited normally chunks per message: 1 , op class: 16 , size: 88 ; allocating... op creation failed; too many inputs in network. chunks per message: 1 , op class: 49 , size: 72 ; allocating... op creation failed; too many inputs in network. chunks per message: 1 , op class: 5 , size: 96 ; allocating... setting monome grid focus, op pointer: 0xD0042944 , value: 1 warning! requested focus, but no handler was set. bad device type maybe? op creation failed; too many inputs in network. setting monome grid focus, op pointer: 0xD0042944 , value: 0 chunks per message: 1 , op class: 6 , size: 84 ; allocating... entered net_midi_list_push D0042944 requested push of midi op already in list - bailing op creation failed; too many inputs in network. entered net_midi_list_remove D0042944
If I didn't know better, I'd say there must be a bug in these four lines: https://github.com/rick-monster/aleph/blob/net-use-malloc/apps/bees/src/net_midi.c#L90-L94 The reason being (again going through it step by step)
Obviously this conclusion makes no sense, those 4 lines look correct!
Gonna write a debug function to print out the entire contents of ml, call it on entry/exit from push/remove midi_op, try & see what's going wrong there...
think I figured out the corner case that was causing this bug!
https://github.com/rick-monster/aleph/commit/0680f00bac21ca697b052d6e7c635310c5c90c19
So the problem with the original code is, what happens if you de-init ml.top when the list is not of length 1? the op to be deleted (ml.top) gets 'squashed out' of the doubly linked list, then ml.top is still left pointing to the disembodied op linked list element that just got deleted!
According to my reasoning on this - this bug must also have been present in the old code before arbitrary op deletion/insertion. (but not dead sure about that) Whatever - anyway now the bug seems to be fixed! This stress test:
Has been running now for over 5 minutes, whereas it would crash out before https://github.com/rick-monster/aleph/commit/0680f00bac21ca697b052d6e7c635310c5c90c19 after only 30s!
yep, that's a bug alright! sorrrry...
Well I somehow doubt that's the only remaining corner-case crash in BEES - there are a lot of lines in there!
Seeing a crash specific to operator insertion/deletion stress test with ops that use net_midi/net_hid (see https://github.com/rick-monster/aleph/issues/21 & https://github.com/monome/aleph/pull/267).
Reading through the code now in net_midi.c - seems like a linked list-ish thing. Also seems possible that algorithm assumes operators are allocated last-in-first-out, and seems plausible to me this may blow up with the observed error message 'error unlinking midi operator' in case the allocation order is broken when de-allocating...
@catfact do you remember writing the code for midi ops? Haven't fully understood the code in net_midi.c yet - does the above explanation hold water?
Seems HID ops also have net_hid.c, similar-ish code - makes sense that both would be blowing up in the same way...