monome / norns

norns is many sound instruments.
http://monome.org
GNU General Public License v3.0
629 stars 145 forks source link

occasional segfault on USB device removal #920

Closed tehn closed 4 years ago

tehn commented 4 years ago

seen a few times now. matron locks up or crashes on grid cable removal.

will add debug info when i get it.

may be related to https://github.com/monome/norns/pull/886 which touched a lot of device management (?)

tehn commented 4 years ago

running build/matron/matron

grid added: 1   monome 128 m1000464 m1000464
grid added: 2   monome 128 m1000464 m1000464
grid added: 3   monome 128 m1000464 m1000464
grid added: 4   monome 128 m1000464 m1000464
grid added: 5   monome 128 m1000464 m1000464
grid added: 6   monome 128 m1000464 m1000464
grid added: 7   monome 128 m1000464 m1000464
grid added: 8   monome 128 m1000464 m1000464
grid added: 9   monome 128 m1000464 m1000464
grid added: 10  monome 128 m1000464 m1000464
Segmentation fault

time to add a bunch of debug prints to the device system

tehn commented 4 years ago

to confirm, this only happens on removal.

trying to identify the segfault location, it seems the whole removal process completes--- through weaver and monome device closure. the segfault occurs momentarily after completion of everything.


i just rolled back to 6551cba63bc3f91de605f780ac7517f5aa3948ed which is before https://github.com/monome/norns/pull/886 was merged, which is the last thing to touch the device code.

i did 100 usb plug cycles without a segfault. previously my max was about 30, sometimes it happened within 1-5.

@ryanlaws @artfwo @catfact could you help re-review this PR for potential problems? https://github.com/monome/norns/pull/886

ryanlaws commented 4 years ago

Unfortunately I do not have a grid to test with. Can you repro with HID? I will test with that when I get home.

tehn commented 4 years ago

just replicated with an HID keyboard after 20 insert/removes

ryanlaws commented 4 years ago

Thank you. Won't be able to look at this for ~8 hours.

tehn commented 4 years ago

no rush--- thanks for your time!

ryanlaws commented 4 years ago

I'm unable to repro :slightly_frowning_face:

Well... what are you thinking? Check whether dn == NULL before checking its d property?

ryanlaws commented 4 years ago

Also what are your HID repro steps? This is what's not working for me:

  1. Open hid-demo/hid-keyboard-test
  2. Plug in keyboard (Leopold Otaku 101), wait for lights to come on
  3. Type, confirm keystrokes visible
  4. Fiddle with system menu if keys don't show up
  5. Unplug
  6. Repeat 2-5

I'm consistently able to type and nothing is observably wrong on the norns. I have probably tried 100+ times now but I'm not counting.

One thing that might be interesting is the fact that the norns seems to think my keyboard is 2 devices for some reason - USB Keyboard and USB Keyboard 2, and it seems like only one of the 2 devices will work at a time. It seems to favor the latter. I've only seen one time where USB Keyboard was working, then I unplugged and it worked again. But probably 60-80% of the time USB Keyboard 2 is working while the other isn't. I wonder if being recognized as multiple devices is somehow saving the keyboard/norns from segfaulting, and the times that USB Keyboard is working are the times when, if the keyboard had only provided 1 device, it would have segfaulted.

Anyway, I will keep an eye on this but I have had no luck reproducing this so far. I'm eager to hear whether anyone else is experiencing this.

tehn commented 4 years ago

repro: plug unplug over and over until segfault

wait for detection prior to unplug

ryanlaws commented 4 years ago

Something just dawned on me. My local build is at 8810899 (immediately after the merge). Can you repro at that point?

It would seem weird for it to happen due to something that didn't touch the device code. It looks like 6ee310b touched some device code too but looks a lot less likely culprit than the lines you pointed out.

ryanlaws commented 4 years ago

Repro'd after 10 tries with e22003b (master HEAD)... so that shoots down my theory about my odd 2-port keyboard. Very curious whether you can repro with 8810899. Again 6ee310b (and anything following it) seems a very unlikely culprit but this is what I'm observing. I'll try popping one commit at a time until I get to where I can't repro...

ryanlaws commented 4 years ago

Do you have other devices connected? It seems to be more unstable when there are more devices plugged in, or especially when there's a few devices connected and I disconnect them and then replug the HID.

tehn commented 4 years ago

i was testing with only a wifi nub. but i found this problem originally across many devices testing with grid alone.

i won't be able to try again until about 8-10 hours from now.

with debug prints i was able to very consistently see that it failed at the lines i indicated: https://github.com/monome/norns/pull/886/files#diff-184a47d97157ed4bd2678ff7549963c9R138

with a few more minutes debugging i can identify the exact line and trace it back... but if i figured it might be in the node sorting. can't see why there would be a race condition, but i thought at some point it got more stable when i added all these prints...

ryanlaws commented 4 years ago

I've been backing out a commit at a time. So far it took a very long time to repro with a44e8f5 - only once did I get it within 10 tries, only other was ~60 and another at ~125. One time I got the following:

failed to open hid device: /dev/input/event0 dev_list_add: error allocating device data failed to open hid device: /dev/input/event1 dev_list_add: error allocating device data

But everything kept ticking along.

The time it took >100, the keyboard handling stopped working and logged errors on every replug, so there might be another bug here:

lua: /home/we/norns/lua/core/hid.lua:149: no entry for hid 4 lua: /home/we/norns/lua/core/hid.lua:149: no entry for hid 6 lua: /home/we/norns/lua/core/hid.lua:149: no entry for hid 8 lua: /home/we/norns/lua/core/hid.lua:149: no entry for hid 10 lua: /home/we/norns/lua/core/hid.lua:149: no entry for hid 12

Re-selecting the app didn't fix it, either.

So I'm not sure if #921 might have introduced something?

As far as those lines you mentioned, I attempted to add a boneheaded fix with 6cc6f99 prior to starting this commit backout exercise, but it actually seemed to make things worse. I'm not sure whether this was the kind of fix you had in mind - I'm sure there's a better way.

catfact commented 4 years ago

thanks @ryanlaws and sorry for the trouble

if we think it's something that came in later i would first try the very next commit adc782c3a49aad061abd988c40da19a993f60b4b

it's not obvious, but this is the first time the async command calls are actually used.

921 itself seems an unlikely culprit - simply detaches the system_cmd worker thread. (but of course it's possible.)

does this seem like a failed allocation? (presumably followed down the line by freeing NULL - the immediate cause?)

i'll take a stab at bisecting also when i have the time. valgrind might help here.

ryanlaws commented 4 years ago

Thanks @catfact. Yeah, I think you're right, #921 seems very unlikely, and it's possible it was just a fluke that the segfaults happened to calm down when I went to the commit prior to that (or I accidentally dropped 2 commits, including adc782c, instead of 1). I don't know my way around valgrind at all, but I'll keep you posted on results if I muster up the courage to try it.

This bug is a bit scary to test too because I can feel the USB ports' lifespans deteriorating with every replug.

catfact commented 4 years ago

ok well don't worry about it then - @tehn and i are better equipped to abuse our hardware :)

tehn commented 4 years ago

@ryanlaws happy to replace any hardware for you when needed, have no fear

catfact commented 4 years ago

@tehn caught this trace

#0  __GI_strcmp (p1=0x6d80dcd0 "/dev/snd/midiC1D0", p2=<optimized out>) at strcmp.c:39
#1  0x00015358 in dev_lookup_path (node_head=0x6d80ede8, path=<optimized out>)
    at ../matron/src/device/device_list.c:39
#2  dev_remove_node (node=<optimized out>, event_remove=<optimized out>, dn=0x6d80ede8)
    at ../matron/src/device/device_list.c:139
#3  dev_list_remove (type=type@entry=DEV_TYPE_MIDI, node=node@entry=0x6d80dcd0 "/dev/snd/midiC1D0")
    at ../matron/src/device/device_list.c:155
#4  0x00015834 in handle_device (dev=dev@entry=0x6d80b688)
    at ../matron/src/device/device_monitor.c:231
#5  0x00015c34 in watch_loop (p=<optimized out>) at ../matron/src/device/device_monitor.c:194
#6  0x76f89fc4 in start_thread (arg=0x6f3fb250) at pthread_create.c:458
#7  0x76adf038 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76
   from /lib/arm-linux-gnueabihf/libc.so.6
catfact commented 4 years ago
tehn commented 4 years ago

for clarity, crash above happened with USB midi device (emu tab 1x1)

ryanlaws commented 4 years ago

Does this seem like a race condition?

catfact commented 4 years ago

kinda, but i'm not immediately grokking how that's possible. the crash comes from the watch_loop thread. is there somewhere else where the device list entry could be de-initialized from main thread??

ryanlaws commented 4 years ago

This is super anecdotal and might be a red herring, but I've seen some odd MIDI behaviors on RPi in the past. Specifically, with a Novation Launchpad Mini I saw the MIDI handler "sleep" so that after a period of inactivity, there would be a lag on sent or received messages. They may have even gotten dropped. I usually worked around this by adding regular "keep alive" messages - a blinking beat pulse worked the treat, as long as the tempo wasn't too slow.

Anyway, perhaps there's some "ready signal" we could watch for?

ngwese commented 4 years ago

after further testing i'm rather confident that PR #992 does indeed fix the this problem. i'm going to close this out. if problems persist we can reopen (or create a new ticket).