Interrupt collision between smsc95xx and USB storage drivers under heavy load

benosteen commented 12 years ago

Steps to reproduce:

1) Lots of files on a USB drive, plugged in and mounted. 2) Begin a download of a large file (100Mb+ is suggested) to that USB drive. 3) During download, try to access large numbers of files (suggestions to follow)

This will at some indeterminate point freeze the system with kernel panics from the USB storage driver - "... not syncing: Fatal exception in interrupt" and kernel errors from the ethernet driver : "kevent may have dropped the interrupt."

Suggested means to replicate step 3)

If rootfs is on USB, apt-get install'ing a group of packages, apt-cache search and so on are good ways to uncover this collision. Otherwise, searching or grepping through a reasonable number of files on the USB is enough (find . | xargs grep -i "foo") for example.

It is hard to capture this error, as the kern.log doesn't sync the errors to disc, and the errors flash by too fast on tty to see them with any clarity.

Recreated with latest kernel + UAS built in and new modules and with kernel modules from 13/04 - with rootfs on USB and with the stock rootfs on SD. Having the rootfs on SD makes it more difficult to simulate the type of storage demand required to replicate the bug however.

Hexxeh commented 12 years ago

I'm seeing this issue too. Possible regression seeing as I don't recall having this problem before, despite having downloaded the same file before. With the latest files, it happens every time I download the file in question.

shirro commented 12 years ago

You might want to put a serial tty on there to capture the errors.

Someone put a screenshot up in the forum of what sounds like the same issue: http://www.raspberrypi.org/forum/troubleshooting/external-hard-drive-kernel-panic#p67994

popcornmix commented 12 years ago

I have seen a kernel panic from dwc_oth driver when copying files from network. Strangely the same experiment doesn't fail at work (or on machine of the colleague who knows this driver best). I had serial connected so got a call stack. Not sure if this is the same issue.

Need a test case that can be made to fail on colleague's setup.

[ 528.407851] Unable to handle kernel paging request at virtual address 88ad3e90 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND [ 528.415071] pgd = c69440003824 552 S 39.5 3.1 1:23.70 rsyslogd [ 528.417771] [88ad3e90] *pgd=00000000S 33.1 1.3 3:28.40 fiberlamp [ 528.421347] Internal error: Oops: 5 [#1].4 12.1 1:17.94 Xorg 1500 root 20 0 4240 1060 400 D 7.0 0.9 0:21.73 cp Entering kdb (current=0xc78c6e40, pid 850) Oops: (null)04.86 top due to oops @ 0xc0225fb4 0 0 0 S 1.2 0.0 0:03.97 kworker/0:0 321 root 20 0 0 0 0 S 0.6 0.0 0:02.86 kswapd0 Pid: 850, comm: ifplugd 0 S 0.6 0.0 0:03.42 mmcqd/0 CPU: 0 Not tainted (3.1.9+ #224) 0 S 0.3 0.0 0:02.73 rcu_kthread PC is at memcpy+0x114/0x3300 0 0 S 0.3 0.0 0:00.15 kworker/0:2 LR is at DWC_MEMCPY+0x18/0x1c3932 2284 S 0.3 3.2 0:02.65 lxpanel pc : [<c0225fb4>] lr : [<c02cf8f0>] psr: 6000019300.89 init sp : c7acdb44 ip : 00000002 fp : c7acdb5c.0 0.0 0:00.00 kthreadd r10: 88ad3e92 r9 : c798f908 r8 : c79f6620.0 0.0 0:00.00 ksoftirqd/0 r7 : c798f8c0 r6 : 0000ffff r5 : c68e2560 r4 : c79890c022 kworker/u:0 r3 : 893bc77f r2 : 3859066a r1 : 88ad3e90 r0 : ffdd000000 khelper Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user Control: 00c5387d Table: 06944008 DAC: 00000015 [<c000f6b4>] (show_regs+0x0/0x58) from [<c007a8cc>] (kdb_dumpregs+0x38/0x60) r4:c05dbff8 r3:00000001 [<c007a894>] (kdb_dumpregs+0x0/0x60) from [<c007d63c>] (kdb_main_loop+0x56c/0x7ac) r6:00000005 r5:c7acdaf8 r4:c05dc204 r3:893bc77f [<c007d0d0>] (kdb_main_loop+0x0/0x7ac) from [<c007ff50>] (kdb_stub+0x280/0x3f8) [<c007fcd0>] (kdb_stub+0x0/0x3f8) from [<c0076678>] (kgdb_handle_exception+0x160/0x64c) [<c0076518>] (kgdb_handle_exception+0x0/0x64c) from [<c00147e0>] (kgdb_notify+0x3c/0x74) [<c00147a4>] (kgdb_notify+0x0/0x74) from [<c03b8810>] (notifier_call_chain+0x54/0x94) r6:00000000 r5:00000000 r4:fffffffc r3:c00147a4 more> [<c03b87bc>] (notifier_call_chain+0x0/0x94) from [<c03b88a4>] (atomic_notifier_call_chain+0x28/0x30) r8:c78c6e40 r7:00000005 r6:c04898cc r5:c7acdaf8 r4:c7acc000 r3:ffffffff [<c03b887c>] (atomic_notifier_call_chain+0x0/0x30) from [<c03b88ec>] (notify_die+0x40/0x4c) [<c03b88ac>] (notify_die+0x0/0x4c) from [<c00123a4>] (die+0xb0/0x364) [<c00122f4>] (die+0x0/0x364) from [<c0018bd4>] (__do_kernel_fault+0x74/0x94) [<c0018b60>] (__do_kernel_fault+0x0/0x94) from [<c03b8440>] (do_page_fault+0xa4/0x36c) r8:c7acc000 r7:00000005 r6:c79944e0 r5:88ad3e90 r4:c7acdaf8 r3:c7acdaf8 [<c03b839c>] (do_page_fault+0x0/0x36c) from [<c03b87b4>] (do_translation_fault+0xac/0xb4) [<c03b8708>] (do_translation_fault+0x0/0xb4) from [<c0008340>] (do_DataAbort+0x40/0xa8) r7:00000005 r6:c057f404 r5:88ad3e90 r4:00000005 [<c0008300>] (do_DataAbort+0x0/0xa8) from [<c03b68dc>] (__dabt_svc+0x3c/0x60) Exception stack(0xc7acdaf8 to 0xc7acdb40) dae0: ffdd0000 88ad3e90 db00: 3859066a 893bc77f c79890c0 c68e2560 0000ffff c798f8c0 c79f6620 c798f908 db20: 88ad3e92 c7acdb5c 00000002 c7acdb44 c02cf8f0 c0225fb4 60000193 ffffffff r8:c79f6620 r7:c7acdb2c r6:ffffffff r5:60000193 r4:c0225fb4 [<c02cf8d8>] (DWC_MEMCPY+0x0/0x1c) from [<c02c4c1c>] (assign_and_init_hc+0x250/0x58c) [<c02c49cc>] (assign_and_init_hc+0x0/0x58c) from [<c02c5c5c>] (dwc_otg_hcd_select_transactions+0x11c/0x18c) [<c02c5b40>] (dwc_otg_hcd_select_transactions+0x0/0x18c) from [<c02c8fbc>] (dwc_otg_hcd_handle_sof_intr+0xb4/0xe4) [<c02c8f08>] (dwc_otg_hcd_handle_sof_intr+0x0/0xe4) from [<c02ca428>] (dwc_otg_hcd_handle_intr+0xd4/0x120) more> r6:00000008 r5:c798f8c0 r4:00000008 r3:00000000 [<c02ca354>] (dwc_otg_hcd_handle_intr+0x0/0x120) from [<c02c7cfc>] (dwc_otg_hcd_irq+0x1c/0x28) r7:00000000 r6:00000001 r5:60000193 r4:c797ddc0 [<c02c7ce0>] (dwc_otg_hcd_irq+0x0/0x28) from [<c02a5c54>] (usb_hcd_irq+0x48/0xc0) [<c02a5c0c>] (usb_hcd_irq+0x0/0xc0) from [<c0080da8>] (handle_irq_event_percpu+0x68/0x258) r6:0000004b r5:0000004b r4:c79825e0 r3:c02a5c0c [<c0080d40>] (handle_irq_event_percpu+0x0/0x258) from [<c0080fd0>] (handle_irq_event+0x38/0x48) [<c0080f98>] (handle_irq_event+0x0/0x48) from [<c0082c2c>] (handle_level_irq+0x90/0x108) r4:c0586edc r3:00020000 [<c0082b9c>] (handle_level_irq+0x0/0x108) from [<c00806ec>] (generic_handle_irq+0x3c/0x50) r4:c0593dac r3:c0082b9c [<c00806b0>] (generic_handle_irq+0x0/0x50) from [<c000efcc>] (handle_IRQ+0x40/0x94) [<c000ef8c>] (handle_IRQ+0x0/0x94) from [<c0008470>] (asm_do_IRQ+0x18/0x1c) r6:f200b200 r5:60000113 r4:c029b0f4 r3:c057ae94 [<c0008458>] (asm_do_IRQ+0x0/0x1c) from [<c03b6938>] (__irq_svc+0x38/0xc0) Exception stack(0xc7acdcf0 to 0xc7acdd38) dce0: 00000004 00000114 00000840 c058b720 dd00: c7a4cb80 c7a4cb80 00000114 00000840 00000001 00000000 bed6e9d0 c7acdd6c dd20: c7acdd70 c7acdd38 c029b210 c029b0f4 60000113 ffffffff [<c029b0c8>] (smsc95xx_write_reg+0x0/0xe0) from [<c029b210>] (smsc95xx_mdio_read+0x68/0xe0) r7:00000001 r6:c7a4cb98 r5:00000001 r4:c7a4cb80 [<c029b1a8>] (smsc95xx_mdio_read+0x0/0xe0) from [<c029a440>] (mii_link_ok+0x40/0x50) more> r8:c7acc000 r7:c03e731c r6:bed6e9f0 r5:0000000a r4:c7a4cc14 [<c029a400>] (mii_link_ok+0x0/0x50) from [<c029cfcc>] (usbnet_get_link+0x50/0x5c) r4:c7a4c800 r3:c029b1a8 [<c029cf7c>] (usbnet_get_link+0x0/0x5c) from [<c031df00>] (dev_ethtool+0x2010/0x25e4) [<c031bef0>] (dev_ethtool+0x0/0x25e4) from [<c0319e9c>] (dev_ioctl+0x5b4/0x8e4) [<c03198e8>] (dev_ioctl+0x0/0x8e4) from [<c0302a64>] (sock_ioctl+0xa0/0x280) [<c03029c4>] (sock_ioctl+0x0/0x280) from [<c00f7518>] (do_vfs_ioctl+0x8c/0x590) r7:00000007 r6:c7523380 r5:bed6e9d0 r4:bed6e9d0 [<c00f748c>] (do_vfs_ioctl+0x0/0x590) from [<c00f7a64>] (sys_ioctl+0x48/0x70) [<c00f7a1c>] (sys_ioctl+0x0/0x70) from [<c000e000>] (ret_fast_syscall+0x0/0x48) r7:00000036 r6:01787008 r5:00000007 r4:bed6ead8

benosteen commented 12 years ago

We may have two separate bugs then, as that doesn't look that familiar.

I'll reconnect the UART and see if I can recreate the USB heavy load one.

(Hexxeh pointed out on IRC that current draw could be a factor, I agree but I can only measure this if I power it via the GPIO pins - does this skip the polyfuse?)

Hexxeh commented 12 years ago

Pretty sure I recall reading somewhere that it /does/ indeed bypass the polyfuse.

shirro commented 12 years ago

My Pi is in transit but I thought I would grab the Debian image and run ksymoops to investigate some of the stacktraces being posted only to discover there is no System.map on the image.

It would be REALLY handy to have a System.map included with the default image. Otherwise we all have to compile our own kernels and trigger the crashes ourselves to debug these things.

popcornmix commented 12 years ago

@shirro Good point. I'll include System.map with next github update.

I think this is the map from latest github firmware. http://dl.dropbox.com/u/3669512/stable/System.map.git

I think this is the map from latest debian firmware. http://dl.dropbox.com/u/3669512/stable/System.map.deb

(I believe they are the same code, but were built on different machines, so the offsets are slightly different)

popcornmix commented 12 years ago

Okay the screenshot has: c022bd90 T DWC_MEMCPY at top of stack so looks like the same panic as my one.

If you use kernel_debug.img (from github) instead of kernel.img you should get stacktrace with function names.

shirro commented 12 years ago

ksymoops looks to be well deprecated since the 2.4 days since the kernel usually prints out the symbols these days. I must be getting old. Perhaps we need that on by default? I just grepped the number out of a pastebin mozzwald put on irc and it is DWC_MEMCPY as well. Perhaps having the html docs in there will not be such a bad thing after all :-)

http://pastebin.com/u4C98Tfq

abishur commented 12 years ago

This thread

http://www.raspberrypi.org/forum/troubleshooting/kernel-panic-on-concurrent-network-and-usb-storage

has a screenshot of the kernel panic I've uploaded two incidents of the panic where I was transferring data to or from a usb attached hard drive on the pi

popcornmix commented 12 years ago

Can anyone rule out a 5V power supply issue E.g. use a 10W ipad charger with high quality USB cable with measured ~5V at board, and still observe the issue. I don't think it is this, but it is something that needs ruling out.

abishur commented 12 years ago

I'm using a 5V 1A HTC charger with high quality usb cable. Does that count?

popcornmix commented 12 years ago

If you've measured the voltage between TP1 and TP2 then yes...

abishur commented 12 years ago

4.75V at full load (two usb devices, ethernet, and hdmi), and error still occurs

mozzwald commented 12 years ago

Here is boot log up to kernel panic while trying to download to USB hard drive: http://pastebin.com/u4C98Tfq

My current setup is:

Latest debian image on Transcend 4GB Class 6 card
Raspi power 5V 1.5A supply, Voltage never goes below 4.84V on the test points and current draw averages 400mA to 500mA.
USB Powered hub w/ 5V 2.1A supply
USB Powered 2.5" SATA HDD on hub
USB Optical mouse on hub
USB Keyboard on pi

benosteen commented 12 years ago

I use a 5V 1A supply, and measured voltage between TP1 and TP2 is around the 4.84V mark before, during and after. Fluctuates by 10mV or so during load. Unfortunately, I think some of the voltage drop is in the cable itself - 0.20V+ - direct voltage at the adapter is around 5.1V but I only measured that very early on.

What would be the sort of voltage drop that would be worrying? 4V? 4.5V? 4.6V?

abishur commented 12 years ago

swapped for another charger, got 4.8 across tp1/tp2 and error still occurred

popcornmix commented 12 years ago

Well I believe USB quotes 5% so 4.75V is the limit. I would expect 4.84V to be fine, so I think this isn't (5V) power related.

The guy who knows most about this driver (although this driver is written by synopsys, so noone at Broadcom knows much about it) is going to try and reproduce this with an external USB drive. Hopefully he'll be able to see it fail.

I've seen the failure at home (copying from NFS mounted drive over network - no USB hard drive involved). But running exactly the same test on work's network didn't fail (and the driver guy couldn't reproduce it). Perhaps the USB drive is a better way of provoking it.

shirro commented 12 years ago

I added symbols to the oops from @mozzwald https://gist.github.com/2471526

larsth commented 12 years ago

To completely rule out PSU issues, maybe add an extra capacitor, so the voltage is more stable - 220 uF should be ok, and not trigger the fuse (i guess).

4,8 volt is a voltage drop equal to 200 mV, which is -4%, and that could be close to a edge of a +/- 5% limit.

Think: a relatively long thin wire on the RPi PCB to the BCM2835 + a large current when the oscillator creates a clock impulse = the BCM2835 creates a relatively large voltage drop over the wire, so the 4,8 volt at the power connector now becomes maybe 4.6 volt at the BCM2835, which is too low.

mozzwald commented 12 years ago

Added 220 uF capacitor to input power as suggested by @larsth, problem persists. Also, tried changing power source of pi to be 5V 2.1A and USB hub to be 5V 1.5A, problem persists.

asb commented 12 years ago

Someone on the forums claims that constantly dropping caches works around the issue:

while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done &

http://www.raspberrypi.org/forum/troubleshooting/kernel-panic-on-concurrent-network-and-usb-storage/#p68752

shirro commented 12 years ago

It might work but it doesn't mean it is the solution. If I am reading the code correctly the usb driver does a memcpy to align some data to an 8 byte boundary if DMA is enabled and sometimes it is accessing memory it should not. It tests for allocation failure so perhaps the length is wrong. Needs some printk I think. I think you could load the usb driver as a module with a parameter to disable dma and that would stop this code ever being executed but that wouldn't really be an answer either. We have the source so there is no real need to guess.

rewolff commented 12 years ago

Two things.... Measuring a 4.84V or even 5.1V on the RPI testpoints is not a guarantee of "no powersupply issues". The Multimeter is way too slow to notice sudden short drops in power. Suppose the USB charger has a "bug" that drops power for a millisecond every 10 seconds? The multimeter will not notice. Of course with a full milisecond of no power the RPI will reset. (The capacitor will hold out for about 0.3 ms). As a charger this wouldn't matter. The product would still work fine for charging cellphone batteries. So this "bug" might go unnoticed. Of course the above scenario is exaggerating. A full reset would be more obvious to RPI users. A slightly more realistic scenario would be that the RPI suddenly needs a bigger current and that the powersupply takes a few ms to react to the higher current draw. That said, it is VERY unlikely that such an issue would result in the observed effects. The crashes seem to be coming from the SAME routine every time.

popcornmix commented 12 years ago

Can anyone confirm whether: while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done &

helps? Whilst not a solution, it is a very useful piece of data if it does work around the problem

benosteen commented 12 years ago

I'll have a go - just dd'ing the debian image fresh to my SD.

benosteen commented 12 years ago

13/04 debian image, freshly dd'd to a 2Gb SD

mount'd a USB stick with a sizeable collection of files (a rootfs)
started the drop_caches loop
wget large_file_from_github
find /mountpoint/of/usb | xargs grep "foo"

Same sort of kernel error http://www.flickr.com/photos/ben_on_the_move/6967016434/in/photostream

(Also, the serial logging of kernel panics seems to be at 115200baud, regardless of cmdline.txt settings. Is this set somewhere else? I can capture the bootup, but I was using 9600 to do so.)

popcornmix commented 12 years ago

From Gray (not directly in response to you, but this question has been asked before):

An awful lot of what is printed during the boot sequence is output by the kernel during initialization - i.e. during the set-up of devices that are later used to support the operating system implemented on the root filie system.

One of the classes of devices that need to be set-up are terminal (tty) devices - so it kind-of follows that the thing being output to during this kernel initialization process isn't really a tty device. The kernel calls it (well them actually) a 'console'. The kernel command line allows you to identify and set the baud rate for these consoles and kernel output goes to them all (e.g. to the HDMI framebuffer console and to the UART console).

Each console normally ends up being presented as a separate tty device in /dev.

Once the operating system gets hold of the devices the kernel has left it, it configures them and uses them as it sees fit. In our case we do the standard thing of running a shell on just about any tty we can find. This is implemented in the file that controls what we do when control is first passed to the operating system - /etc/inittab.

In /etc/inittab each tty is read by a program 'getty' in its own process. This explains why, once you get to a log-on prompt, [1] the baud rate might change; and [2] the output is no longer the same as it is on other console/ttys. (You may have noticed that you can log on separately to a shell over the UART and a different one over the HDMI/keyboard.)

So, in short, edit /etc/inittab and change /sbin/getty -L ttyAMA0 115200 vt100 to /sbin/getty -L ttyAMA0 9600 vt100 if you want the operating system to run at 9600 baud.

mozzwald commented 12 years ago

while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done &

This actually makes the problem worse for me. Running it then trying to download file to USB device cause kernel panic instantly. Without it the file will download for a while before kernel panic.

larsth commented 12 years ago

@shirro

memcpy? Where?

If a device driver in kernel space uses plain C memory copying from user space, instead of using the copy_from_user(9) function, then you has maybe found the bug we is searching for.

Very long list of where you can find the "copy_from_user" word in the kernel : http://lxr.free-electrons.com/ident?i=copy_from_user

I know that a large part of the USB stuff is in user space (AFAIK), but some of it is of course in kernel space.

popcornmix commented 12 years ago

The fault is the length passed to DWC_MEMCPY is garbage. When I added some logging the length was 3349608928. It seems the URB is getting corrupted somewhere...

larsth commented 12 years ago

@popcornmix any possibility of that could be a pointer to an int - used as an int

shirro commented 12 years ago

Probably @larsth but as you may have guessed I don't know much about kernel internals but I know enough to recognise a likely buffer overflow which looks to be confirmed. The driver would never get into mainline, I know that much. It has a compatibility layer to ease porting and there is a macro called dwc_memcpy for a function DCW_MEMCPY which wraps a call to a memcpy and that is as far as I went down the rabbit hole. I am due to get a Pi any day now. Hopefully there will still be some bugs left. I found a few other DWC drivers referenced on an OpenWRT mailing list and it looks like some of them are considerably simpler. Since the OTG functionality isn't available on this hardware anyway I wonder if one of those other drivers wouldn't be better?

asb commented 12 years ago

Hi @shirro could you give pointers to any dwc drivers you saw? I'm pretty sure I've only seen other releases of the synopsys code for dwc2. A recent discovery is that the upstream Samsung s3c-hsotg is in fact an instantiation of the dwc. Even better, Samsung devs have been generalising it so it could be used with other versions and reasonably renamed to 'dwc2'. The code is probably a far better starting point, but unfortunately only supports peripheral mode at the moment.

See the discussion at http://thread.gmane.org/gmane.linux.usb.general/61676

shirro commented 12 years ago

The OpenWRT dev list seems to refer to several DWC drivers from different places over time and lots of patches. The one I posted on IRC I will put here for everyone: http://permalink.gmane.org/gmane.comp.embedded.openwrt.devel/12602 - there is a link to the source SztupY has taken out of the Samsung Cyanogen Android source. There is also a Fritzbox link. And I have seen mention of others on those lists. If the s3c-hsotg has a good rep I might try and give it a go when I get my Pi. Is there an official repo for it somewhere? I am guessing I probably need to grab it out of an Android kernel source?

asb commented 12 years ago

@shirro Check out the replies in the thread I linked to. There are refs there. s3c-hsotg is upstream (but device only afaik), and there is a recent patchset against on the linux-usb mailing list. The other potential starting point is APM's version of the dwc code (they got permission to replace the license with GPL, and do at least meet kernel coding style). http://article.gmane.org/gmane.linux.usb.general/53348

It would be fantastic if you were able to help look in to some of these issues.

narensankar commented 12 years ago

We actually tried in the past to look at porting other DWC drivers to the pi. But in general the problem is that in every SOC the DWC logic is hooked up differently and once we hack up one of the alternative drivers to match our logic, it makes it impossible to update from the "official" Synopsys sources. Our official support is from Synopsys and if we break it they won't come rushing to our help.

benosteen commented 12 years ago

Logically, I guess the next series of questions are:

1 - Are Synopsys aware of this showstopper bug? 2 - Would they acknowledge the problem as being theirs? 3 - Is fixing the bug part of the support they will offer?

shirro commented 12 years ago

Sorry if it is redundant but I want to add a me too. Just got mine and was doing a git clone and copying a video from an sshfs mount to usb storage and I got the exact same error.

asb commented 12 years ago

For those who are able to reproduce, does adding vm.min_free_kbytes = 12288 to /etc/sysctl.conf or smsc95xx.turbo_mode=N to /boot/cmdline.txt alleviate the issue?

pepedog commented 12 years ago

Shoot. I was confident I had vm.min_free_kbytes=8192 (archlinuxarm), it's commented out. Restored it, rebooted, yes still have problem. I thought it was gone until 197Mb downloaded and got a panic. I tried these fixes back in November to fix an older issue, how they vanished I don't know, used 12288 figure as well. Can't understand this, sure my arch install of a month ago didn't have this problem. Will try to revert.

Edit, drop cache loop stopped panic 2012-04-26 10:34:53 (1.65 MB/s) - `debian6-19-04-2012.zip.2' saved [464583238/464583238] sum checks fine too

asb commented 12 years ago

Potentially relevant is that the behaviour of smsc95xx/usbnet isn't exactly stellar. It can generate an unbounded number of skbs. See Ming Lei's comments at https://bugs.launchpad.net/ubuntu/+source/linux-ti-omap4/+bug/690370

I suppose https://bugs.launchpad.net/ubuntu/+source/linux-ti-omap4/+bug/746137 also has some relevance (using the vm.min_free_kbytes workaround). There may well be multiple issues in play here.

popcornmix commented 12 years ago

Some promising news. narensankar (he's a colleague who's done a lot of work on 2835 kernel drivers) has sent me a patch and it's fixed my panic! I've just pushed it, so if you're building your own kernel, then please test. I'll commit prebuilt files tomorrow. As asb says, I don't think this is the only issue with this driver, but hopefully a significant one has gone.

mozzwald commented 12 years ago

Compiled and tested kernel with the new patch. No more kernel panics. On the other hand, ethernet performance is not great. I am using vm.min_free_kbytes = 12288 in /etc/sysctl.conf and smsc95xx.turbo_mode=Y in /boot/cmdline.txt

My compiled version is here if anyone would like to try it: md5 34185ad45c130ac124b1d8ea6477b81c http://mozzwald.com/raspi/linux-raspi-usb-fix_20120427.tar.bz2

narensankar commented 12 years ago

Consistently dropped?

By how much? Naren Sankar Broadcom Corporation +886 975 355 267

----- Original Message ----- From: mozzwald [mailto:reply@reply.github.com] Sent: Friday, April 27, 2012 12:25 AM To: Naren (Narendra) Sankar Subject: Re: [firmware] Interrupt collision between smsc95xx and USB storage drivers under heavy load (#9)

Compiled and tested kernel with the new patch. No more kernel panics. On the other hand, ethernet perfomance dropped quite a bit.

Reply to this email directly or view it on GitHub: https://github.com/raspberrypi/firmware/issues/9#issuecomment-5376064

benosteen commented 12 years ago

@mozzwald Was this with or without the turbo=N boot parameter?

mozzwald commented 12 years ago

Speed problems were because I was d/ling to slow USB flash drive. Speed is ok when d/ling to USB hard drive.

pepedog commented 12 years ago

I have removed all mods to cmdline.txt and drop cache, with new kernel and firmare I don't see this problem now. Before this though had a panic occur overnight when nothing was happening, maybe just cron, 1st time ever that happened.

rewolff commented 12 years ago

FYI, I tried reproducing the bug for a short time yesterday, but didn't succeed (i.e. before applying the patch). Although I do seem to remember having seen it once or twice when I didn't WANT to reproduce it. I'll be running a patched kernel from now on.

jmattsson commented 12 years ago

This is looking good! I have >36h uptime after applying the patch, whereas before I'd see a few crashes a day! Thanks!

raspberrypi / firmware

Interrupt collision between smsc95xx and USB storage drivers under heavy load #9