todbot / blink1

Official software for blink(1) USB RGB LED by ThingM
https://blink1.thingm.com/
Other
957 stars 237 forks source link

Kernel crash on Linux #198

Open riquito opened 9 years ago

riquito commented 9 years ago

It happened 4 times, happens only when I'm testing the blink1, I'm confident that that's the cause. The gist contains the content of /var/log/messages during the latest crash.

https://gist.github.com/riquito/5c48037b6929bacafdf7

It may be linked to the unplug of the device, but I'm not sure. I'll try to see if I can replicate reliably (not that I have fun crashing my pc :-P)

edit: Fedora 21, x86_64, kernel 3.17.4-301

ellson commented 9 years ago

Me too. Multiple crashes, multiple Fedora-21 machines. Hard crash of kernel requiring reboot. Sometimes on first use of blink1. Semi-reliable crash on first use after removing and reinserting blink1 in different USB port.

kernel-3.17.7-300.fc21.x86_64

(How did riquito get /var/log/messages? Isn't it in journactl now?)

journalctl shows:

Jan 09 01:48:25 mldt kernel: thingm 0003:27B8:01ED.0006: hidraw2: USB HID v1.01 Device [ThingM blink(1) mk2] on usb-0000:00:1d.0-1.2/input0 Jan 09 01:48:25 mldt mtp-probe[2853]: checking bus 2, device 7: "/sys/devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2" Jan 09 01:48:25 mldt mtp-probe[2853]: bus: 2, device: 7 was not an MTP device Jan 09 01:48:28 mldt kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000009 Jan 09 01:48:28 mldt kernel: IP: [] free_pid+0x7d/0x160 Jan 09 01:48:28 mldt kernel: PGD f9318d067 PUD f93011067 PMD 0 Jan 09 01:48:28 mldt kernel: Oops: 0002 [#1] SMP Jan 09 01:48:28 mldt kernel: Modules linked in: bnep bluetooth rfkill fuse xt_CHECKSUM ipt_MASQUERADE tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT xt_co Jan 09 01:48:28 mldt kernel: tpm_infineon tpm_tis snd_seq_device tpm snd_pcm snd_timer snd mei_me lpc_ich mei mfd_core soundcore shpchp binfmt_misc mxm_wmi e1000e ptp pps_core wmi Jan 09 01:48:28 mldt kernel: CPU: 0 PID: 2866 Comm: b1t Tainted: P OE 3.17.7-300.fc21.x86_64 #1 Jan 09 01:48:28 mldt kernel: Hardware name: MSI MS-7760/X79A-GD45 Plus (MS-7760), BIOS V17.7 12/20/2013 Jan 09 01:48:28 mldt kernel: task: ffff880f930a89d0 ti: ffff8800b5cb4000 task.ti: ffff8800b5cb4000 Jan 09 01:48:28 mldt kernel: RIP: 0010:[] [] free_pid+0x7d/0x160 Jan 09 01:48:28 mldt kernel: RSP: 0018:ffff8800b5cb7e40 EFLAGS: 00010002 Jan 09 01:48:28 mldt kernel: RAX: 0000000000000001 RBX: ffff880fd198cc80 RCX: ffff880fd198cc80 Jan 09 01:48:28 mldt kernel: RDX: ffff88103ff921b8 RSI: 0000000000001492 RDI: ffffffff81c0a100 Jan 09 01:48:28 mldt kernel: RBP: ffff8800b5cb7e60 R08: 0000000000000046 R09: ffff8800b5cb7e58 Jan 09 01:48:28 mldt kernel: R10: 0000000000000000 R11: 000000000000001a R12: 0000000000000046 Jan 09 01:48:28 mldt kernel: R13: ffffffff81c4b9c0 R14: 0000000000000000 R15: ffff880fe4fb2940 Jan 09 01:48:28 mldt kernel: FS: 0000000000000000(0000) GS:ffff88103fc00000(0000) knlGS:0000000000000000 Jan 09 01:48:28 mldt kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 09 01:48:28 mldt kernel: CR2: 0000000000000009 CR3: 0000000f93012000 CR4: 00000000001407f0 Jan 09 01:48:28 mldt kernel: Stack: Jan 09 01:48:28 mldt kernel: ffff880f930a89d0 ffff880faa079b00 0000000000000000 0000000000000000 Jan 09 01:48:28 mldt kernel: ffff8800b5cb7e70 ffffffff810b30d1 ffff8800b5cb7e80 ffffffff810b3610 Jan 09 01:48:28 mldt kernel: ffff8800b5cb7ed8 ffffffff810981a2 ffff880f930a8db0 0000000000000000 Jan 09 01:48:28 mldt kernel: Call Trace: Jan 09 01:48:28 mldt kernel: [] __change_pid+0x71/0x80 Jan 09 01:48:28 mldt kernel: [] detach_pid+0x10/0x20 Jan 09 01:48:28 mldt kernel: [] release_task+0x222/0x480 Jan 09 01:48:28 mldt kernel: [] do_exit+0x75b/0xaa0 Jan 09 01:48:28 mldt kernel: [] SyS_exit+0x17/0x20 Jan 09 01:48:28 mldt kernel: [] system_call_fastpath+0x16/0x1b Jan 09 01:48:28 mldt kernel: Code: 83 c6 01 44 3b 73 04 0f 87 8b 00 00 00 49 63 ce 48 c1 e1 05 48 01 d9 48 8b 41 40 48 8b 51 48 4c 8b 69 38 48 85 c0 48 89 02 74 04 <48> 89 50 08 48 b8 Jan 09 01:48:28 mldt kernel: RIP [] free_pid+0x7d/0x160 -- Reboot --

ellson commented 9 years ago

(( The previous comment is missing data. Apparently I can't paste text containing < or > . Also, the kernel guys won't like the tainted kernel from this machine. I can grab this again from different host, if needed? ))

todbot commented 9 years ago

Hi! What is the exact distro and Linux version you are using? What are you exact command-line commands you are using to control the blink(1)? Are you using a pre-compiled blink1-tool, or something else? If you are using a pre-compiled binary, what is the download URL of that binary?

ellson commented 9 years ago

mldt:~$ cat /etc/redhat-release Fedora release 21 (Twenty One) mldt:~$ uname -a Linux mldt 3.17.7-300.fc21.x86_64 #1 SMP Wed Dec 17 03:08:44 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

mldt:commandline (master)$ git pull Already up-to-date. mldt:commandline (master)$ make EXEFLAGS= Building for OS=linux BLINK1_VERSION=v1.95-linux-x86_64 cc -shared -o libblink1.so pkg-config libusb-1.0 --libs -lrt -lpthread -ldl -DUSE_HIDAPI -I./hidapi/hidapi pkg-config libusb-1.0 --cflags -fPIC -std=gnu99 -g -DBLINK1_VERSION=\"""v1.95"-linux-"x86_64""\" ./hidapi/libusb/hid.o blink1-lib.o pkg-config libusb-1.0 --libs -lrt -lpthread -ldl

I was just using simple commands like:

blink1-tool --on blink1-tool --off

remove and reinsert blin1 in different usb port

blink1-tool --on

ellson commented 9 years ago

On a different host (work) same distro and os (not-tainted) I tried running under gdb.

I got no useful information when crashed. It's just a total instantaneous system lockup.

The only visible oddity in journalctl was this error message when the blink1 was removed, but the crash didn't occur until the device was reinserted and a command sent, none of which made it into the logs.

Jan 09 16:14:11 work kernel: usb 1-1.3: USB disconnect, device number 3 Jan 09 16:14:11 work systemd-udevd[23533]: error opening USB device 'descriptors' file -- Reboot --

todbot commented 9 years ago

This seems like an issue with the USB drivers in Fedora, or somehow HIDAPI (the library we use to talk to blink(1)) or libusb (the library HIDAPI uses) is tickling some problem further down the software stack.

Is there a different non-Fedora 21 you can try? Or, is there a list of changes between Fedora 18 (the last I tried) and Fedora 21?

hprid commented 9 years ago

I experience a crash too. I'm not using Fedora but Debian Jessie:

henning@henning-laptop:~$ uname -a
Linux henning-laptop 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt2-1 (2014-12-08) x86_64 GNU/Linux

Here is the dmesg output: https://gist.github.com/hprid/2f2f3063abc3c2bf16de#file-dmesg-blink1-txt-L33

I can't reproduce it exactly, but is has something to do with unplugging/replugging the device.

hprid commented 9 years ago

After some more research I found that the issue seems to be fixed in newer kernel versions with commit 67a97845830f79584c9db8849ac723e5d2d57f65, which is not present in Debian Jessie. After rebuilding the kernel with this patch I no longer can reproduce the issue.

riquito commented 9 years ago

Nice catch @hprid The patch should be available from Kernel 3.17 onwards if I'm not mistaken

todbot commented 9 years ago

Thanks for the patch info, that's very interesting.

I think that kernel patch only applies if you're using the blink(1) kernel driver. At least one person commenting on this issue was using the userspace blink1-tool.

If anyone using blink1-tool is having a crash issue (which for the life of me I can't see how a userspace prog should crash Linux nowadays), I would like to enlist them to try two tests:

  1. If you hadn't already, compile blink1-tool on their own system by checking out the repo and doing cd blink1/commandline && make. I've seen USB-based binaries "mostly work" across distros but act flakey. Maybe this is a version of that flakey.
  2. Try using a hidraw build of blink1-tool instead of libusb. Do this with cd blink1/commandline && make clean && make USBLIB_TYPE=HIDDATA.

The (2) above is more of a work-around than a solution but may at least not tickle the bug since it's using a different low-level USB API.

hprid commented 9 years ago

The blink1-tool triggered the kernel crash for me, I also tried recompiling blink1-tool, which crashed the kernel too. The easiest way to crash the kernel in a reproducible way (before applying the mentioned patch) was:

while true; do  ./blink1-tool --list; done

and replugging the blink1 several times. Just replugging it several times without running blink1-tool doesn't crash the kernel.

Haven't tried a hidraw build, but can test it tomorrow if you like.

hprid commented 9 years ago

@riquito The patch isn't in 3.17, last commit regarding the patched file on 3.17.8 is e4aecaf2f53bc6635b484ee2f1b8a1e4c73e7997 (Tue Jun 3 13:29:38 2014 +0200). First kernel version with the patch is 3.18.

aonach commented 9 years ago

I think this is the same issue (although for me it always crashes).

I'm using Debian Wheezy (3.16). I have the udev rules setup.

Following todbot's instructions from above when I do:

1 - The build compiles and I can run ./blink1-tool however whenever I send a command to the device such as a simple ./blink1-tool --on my whole system completely freezes. Note it does (mostly) send the command to the blink light but the only means of recovery is to turn off at the power button. I've repeated this a number of times and it is repeatable.

2 - When I do 'make USBLIB_TYPE=HIDDATA` it compiles but I can't send anything to the blink light. See the errors encountered below for various commands.

`joseph@pixel:~/dev-home/blink1/commandline$ ./blink1-tool --list

blink(1) list:

id:0 - serialnum:

(Listing not supported in HIDDATA builds)

joseph@pixel:~/dev-home/blink1/commandline$ ./blink1-tool --on

set dev:0 to rgb:0xff,0xff,0xff over 300 msec

Error sending message: error sending control message: Device or resource busy

joseph@pixel:~/dev-home/blink1/commandline$ ./blink1-tool --on

set dev:0 to rgb:0xff,0xff,0xff over 300 msec

Error sending message: error sending control message: Device or resource busy

joseph@pixel:~/dev-home/blink1/commandline$ sudo ./blink1-tool --on

[sudo] password for joseph:

set dev:0 to rgb:0xff,0xff,0xff over 300 msec

Error sending message: error sending control message: Device or resource busy

joseph@pixel:~/dev-home/blink1/commandline$ sudo ./blink1-tool --on -v

deviceId[0] = 0

cached list:

0: serial: '' ''

openById: 0

set dev:0 to rgb:0xff,0xff,0xff over 300 msec

Error sending message: error sending control message: Device or resource busy `

Any and all suggestions welcome. I can pull out logs if you tell me what to look for and/or where to go. This is a very vanilla Debian Wheezy installation.

Thanks, Joseph