raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.08k stars 4.96k forks source link

mmcblk0 error -110 errors (regression?) #280

Closed oerik closed 11 years ago

oerik commented 11 years ago

On 3.8.8+ kernel with an class 10 4Gb card (unknown brand) gives me mmcblk0 errors, the 3.6.11 version didn't give me these errors (was running 2 weeks non-stop with 3.6.11 kernel). A second rpi with an Transcend SDTRHC8GB is working just fine. I'm going the replace the SD card anyway. Note, the 4Gb card doesn't support fstrim, the 8Gb does.

[13288.492420] mmc0: final write to SD card still running [13298.479242] mmc0: Timeout waiting for hardware interrupt - cmd12. [13298.480375] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900 [13298.930974] mmc0: final write to SD card still running [13308.919721] mmc0: Timeout waiting for hardware interrupt - cmd12. [13308.920845] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900 [15090.374963] mmc0: final write to SD card still running [15100.403772] mmc0: Timeout waiting for hardware interrupt - cmd12. [15100.404897] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900 [15100.856441] mmc0: final write to SD card still running [15110.884252] mmc0: Timeout waiting for hardware interrupt - cmd12. [15110.885382] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900

hvenzke commented 11 years ago

@popcornmix can do us an favor an made an DEBUG flagged kernel-firmware avalible that i.e has mutch more sd-card/OTG/USB2/.../ Hardware relalated DEBUG enabled. we have to find the holy shit BUG(s)

ottuzzi commented 11 years ago

Actually, on my system, I have not usb devices attached but I have a remote drive mounted via davfs that probably is causing quite a bit of network traffic: if I'm not mistaken even the ethernet port is attached to usb... isn't it?

P33M commented 11 years ago

Interestingly with the fiq_split branch I only get the more benign error:

[    5.089267] input: BTC  USB Keyboard as /devices/platform/bcm2708_usb/usb1/1-1/1-1.2/1-1.2.5/1-1.2.5.5/1-1.2.5.5:1.0/input/input1
[    5.157636] hid-generic 0003:046E:6000.0002: input,hidraw1: USB HID v1.10 Keyboard [BTC  USB Keyboard] on usb-bcm2708_usb-1.2.5.5/input0
[   15.219809] hid-generic 0003:046E:6000.0003: usb_submit_urb(ctrl) failed: -1
[   15.238818] hid-generic 0003:046E:6000.0003: timeout initializing reports
[   15.256451] input: BTC  USB Keyboard as /devices/platform/bcm2708_usb/usb1/1-1/1-1.2/1-1.2.5/1-1.2.5.5/1-1.2.5.5:1.1/input/input2
[   15.287568] hid-generic 0003:046E:6000.0003: input,hidraw2: USB HID v1.10 Device [BTC  USB Keyboard] on usb-bcm2708_usb-1.2.5.5/input1
[   23.046281] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null)
[   23.600613] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null)
[   30.162990] mmc0: missed completion of cmd 18 DMA (512/512 [1]/[1]) - ignoring it
[   30.180470] mmc0: DMA IRQ 6 ignored - results were reset
[   34.529606] smsc95xx 1-1.1:1.0 eth0: link up, 100Mbps, full-duplex, lpa 0xCDE1
[   38.824104] Adding 102396k swap on /var/swap.  Priority:-1 extents:1 across:102396k SS

hexxeh doesn't have the vanilla 3.8 branch available on rpi-update and my build machine is ... poorly. I will report back with a test on 3.8.

My gut feeling is that the hid probing is taking place with preemption disabled - though I have not yet traced the states of the two concurrent systems (sd card read and usbhid probe)

ghollingworth commented 11 years ago

I believe this is the same problem as the dequeue problem, basically when you get some kind of error the system tries to dequeue a transaction. This then turns off a whole bunch of interrupts whilst it waits for the channel halted interrupt (which should happen when you disable an active channel)

But I never saw this work properly and instead the channel never halts… (we basically lose a channel) it might explain the same problem.

Are you sure we're not seeing a channel halted interrupt that's being ignored because of the changes a while back to get rid of the channel halted messages? This is easy to reproduce with a stty –F /dev/ttyUSB0

Gordon

From: P33M notifications@github.com<mailto:notifications@github.com> Reply-To: raspberrypi/linux reply@reply.github.com<mailto:reply@reply.github.com> Date: Tuesday, 21 May 2013 15:46 To: raspberrypi/linux linux@noreply.github.com<mailto:linux@noreply.github.com> Subject: Re: [linux] mmcblk0 error -110 errors (regression?) (#280)

I'm going to chime in on this thread because I believe I have a minimal failure case that produces this exact result.

Because my MS Sidewinder keyboard was pretty much unusable with the Pi until the FIQ got split transaction support, I bought the cheapest keyboard on ebuyer - it's a:

Bus 001 Device 006: ID 046e:6000 Behavior Tech. Computer Corp. Device Descriptor: bLength 18 bDescriptorType 1 bcdUSB 1.10 bDeviceClass 0 (Defined at Interface level) bDeviceSubClass 0 bDeviceProtocol 0 bMaxPacketSize0 8 idVendor 0x046e Behavior Tech. Computer Corp. idProduct 0x6000 bcdDevice 1.80 iManufacturer 1 BTC iProduct 2 USB Keyboard iSerial 0 bNumConfigurations 1 Configuration Descriptor: bLength 9 bDescriptorType 2 wTotalLength 59 bNumInterfaces 2 bConfigurationValue 1 iConfiguration 0 bmAttributes 0xa0 (Bus Powered) Remote Wakeup MaxPower 98mA Interface Descriptor: bLength 9 bDescriptorType 4 bInterfaceNumber 0 bAlternateSetting 0 bNumEndpoints 1 bInterfaceClass 3 Human Interface Device bInterfaceSubClass 1 Boot Interface Subclass bInterfaceProtocol 1 Keyboard iInterface 0 HID Device Descriptor: bLength 9 bDescriptorType 33 bcdHID 1.10 bCountryCode 0 Not supported bNumDescriptors 1 bDescriptorType 34 Report wDescriptorLength 69 Report Descriptors: \ UNAVAILABLE Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x81 EP 1 IN bmAttributes 3 Transfer Type Interrupt Synch Type None Usage Type Data wMaxPacketSize 0x0008 1x 8 bytes bInterval 10 Interface Descriptor: bLength 9 bDescriptorType 4 bInterfaceNumber 1 bAlternateSetting 0 bNumEndpoints 1 bInterfaceClass 3 Human Interface Device bInterfaceSubClass 0 No Subclass bInterfaceProtocol 0 None iInterface 0 HID Device Descriptor: bLength 9 bDescriptorType 33 bcdHID 1.10 bCountryCode 0 Not supported bNumDescriptors 1 bDescriptorType 34 Report wDescriptorLength 50 Report Descriptors: \ UNAVAILABLE Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x82 EP 2 IN bmAttributes 3 Transfer Type Interrupt Synch Type None Usage Type Data wMaxPacketSize 0x0003 1x 3 bytes bInterval 10 Device Status: 0x0000 (Bus Powered)

The great thing (or annoying depending on perspective) about this keyboard is that it will reliably knacker my SD card with the same -110 "timeout waiting for hardware interrupt" messages. It does this when hid-generic is probing the device and seems associated with a failed usb_submit_urb call. This is fatal if I leave the device plugged in while booting - invariably my filesystem ends up corrupt.

Typical log of broken behaviour:


[ 183.029459] usb 1-1.2.5.4: New USB device found, idVendor=046e, idProduct=6000
[ 183.029491] usb 1-1.2.5.4: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 183.029507] usb 1-1.2.5.4: Product: USB Keyboard
[ 183.029522] usb 1-1.2.5.4: Manufacturer: BTC
[ 183.044149] input: BTC USB Keyboard as /devices/platform/bcm2708_usb/usb1/1-1/1-1.2/1-1.2.5/1-1.2.5.4/1-1.2.5.4:1.0/input/input0
[ 183.045032] hid-generic 0003:046E:6000.0001: input,hidraw0: USB HID v1.10 Keyboard [BTC USB Keyboard] on usb-bcm2708_usb-1.2.5.4/input0
[ 193.042827] hid-generic 0003:046E:6000.0002: usb_submit_urb(ctrl) failed: -1
[ 193.042890] hid-generic 0003:046E:6000.0002: timeout initializing reports
[ 193.045838] input: BTC USB Keyboard as /devices/platform/bcm2708_usb/usb1/1-1/1-1.2/1-1.2.5/1-1.2.5.4/1-1.2.5.4:1.1/input/input1
[ 193.047830] hid-generic 0003:046E:6000.0002: input,hidraw1: USB HID v1.10 Device [BTC USB Keyboard] on usb-bcm2708_usb-1.2.5.4/input1

Typical log if I break my filesystem:

[ 2.642966] usb 1-1: New USB device found, idVendor=0424, idProduct=9512
[ 2.657571] usb 1-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[ 2.672700] hub 1-1:1.0: USB hub found
[ 2.684576] hub 1-1:1.0: 3 ports detected
[ 2.972843] usb 1-1.1: new high-speed USB device number 3 using dwc_otg
[ 3.103354] usb 1-1.1: New USB device found, idVendor=0424, idProduct=ec00
[ 3.130742] usb 1-1.1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[ 3.165726] smsc95xx v1.0.4
[ 3.247827] smsc95xx 1-1.1:1.0: eth0: register 'smsc95xx' at usb-bcm2708_usb-1.1, smsc95xx USB 2.0 Ethernet, b8:27:eb:97:5a:1f
[ 3.362897] usb 1-1.2: new high-speed USB device number 4 using dwc_otg
[ 3.493519] usb 1-1.2: New USB device found, idVendor=1a40, idProduct=0201
[ 3.520923] usb 1-1.2: New USB device strings: Mfr=0, Product=1, SerialNumber=0
[ 3.550104] usb 1-1.2: Product: USB 2.0 Hub [MTT]
[ 3.583667] hub 1-1.2:1.0: USB hub found
[ 3.610593] hub 1-1.2:1.0: 7 ports detected
[ 3.912836] usb 1-1.2.5: new high-speed USB device number 5 using dwc_otg
[ 4.043619] usb 1-1.2.5: New USB device found, idVendor=1a40, idProduct=0201
[ 4.072730] usb 1-1.2.5: New USB device strings: Mfr=0, Product=1, SerialNumber=0
[ 4.106510] usb 1-1.2.5: Product: USB 2.0 Hub [MTT]
[ 4.128830] hub 1-1.2.5:1.0: USB hub found
[ 4.140887] hub 1-1.2.5:1.0: 7 ports detected
[ 4.179084] udevd[154]: starting version 175
[ 4.432928] usb 1-1.2.5.2: new low-speed USB device number 6 using dwc_otg
[ 4.578315] usb 1-1.2.5.2: New USB device found, idVendor=045e, idProduct=0039
[ 4.598046] usb 1-1.2.5.2: New USB device strings: Mfr=1, Product=3, SerialNumber=0
[ 4.613582] usb 1-1.2.5.2: Product: Microsoft 5-Button Mouse with IntelliEye(TM)
[ 4.628633] usb 1-1.2.5.2: Manufacturer: Microsoft
[ 4.669080] input: Microsoft Microsoft 5-Button Mouse with IntelliEye(TM) as /devices/platform/bcm2708_usb/usb1/1-1/1-1.2/1-1.2.5/1-1.2.5.2/1-1.2.5.2:1.0/input/input0
[ 4.743270] hid-generic 0003:045E:0039.0001: input,hidraw0: USB HID v1.10 Mouse [Microsoft Microsoft 5-Button Mouse with IntelliEye(TM)] on usb-bcm2708_usb-1.2.5.2/input0
[ 4.902870] usb 1-1.2.5.4: new low-speed USB device number 7 using dwc_otg
[ 5.048773] usb 1-1.2.5.4: New USB device found, idVendor=046e, idProduct=6000
[ 5.092670] usb 1-1.2.5.4: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 5.151879] usb 1-1.2.5.4: Product: USB Keyboard
[ 5.182719] usb 1-1.2.5.4: Manufacturer: BTC
[ 5.234606] input: BTC USB Keyboard as /devices/platform/bcm2708_usb/usb1/1-1/1-1.2/1-1.2.5/1-1.2.5.4/1-1.2.5.4:1.0/input/input1
[ 5.300862] hid-generic 0003:046E:6000.0002: input,hidraw1: USB HID v1.10 Keyboard [BTC USB Keyboard] on usb-bcm2708_usb-1.2.5.4/input0
[ 15.376203] hid-generic 0003:046E:6000.0003: usb_submit_urb(ctrl) failed: -1
[ 15.417770] hid-generic 0003:046E:6000.0003: timeout initializing reports
[ 15.453369] input: BTC USB Keyboard as /devices/platform/bcm2708_usb/usb1/1-1/1-1.2/1-1.2.5/1-1.2.5.4/1-1.2.5.4:1.1/input/input2
[ 15.493088] hid-generic 0003:046E:6000.0003: input,hidraw2: USB HID v1.10 Device [BTC USB Keyboard] on usb-bcm2708_usb-1.2.5.4/input1
[ 24.442569] mmc0: Timeout waiting for hardware interrupt - cmd25.
[ 24.462872] mmcblk0: error -110 transferring data, sector 1696160, nr 24, cmd response 0x900, card status 0xc00
[ 24.483431] end_request: I/O error, dev mmcblk0, sector 1696168
[ 24.500060] end_request: I/O error, dev mmcblk0, sector 1696176
[ 24.515670] Aborting journal on device mmcblk0p2-8.
[ 24.545903] EXT4-fs error (device mmcblk0p2) in ext4_reserve_inode_write:4550: Journal has aborted
[ 24.564696] EXT4-fs error (device mmcblk0p2) in ext4_reserve_inode_write:4550: Journal has aborted
[ 24.583165] EXT4-fs error (device mmcblk0p2) in ext4_reserve_inode_write:4550: Journal has aborted
[ 24.601522] EXT4-fs error (device mmcblk0p2) in ext4_reserve_inode_write:4550: Journal has aborted
[ 24.619843] EXT4-fs error (device mmcblk0p2) in ext4_reserve_inode_write:4550: Journal has aborted
[ 24.638103] EXT4-fs error (device mmcblk0p2) in ext4_reserve_inode_write:4550: Journal has aborted
[ 24.656346] EXT4-fs error (device mmcblk0p2) in ext4_reserve_inode_write:4550: Journal has aborted
[ 28.472876] EXT4-fs error (device mmcblk0p2): ext4_journal_start_sb:349: Detected aborted journal
[ 28.504677] EXT4-fs (mmcblk0p2): Remounting filesystem read-only
[ 28.521591] EXT4-fs error (device mmcblk0p2): ext4_journal_start_sb:349: Detected aborted journal
[ 28.550737] EXT4-fs error (device mmcblk0p2): ext4_journal_start_sb:349: Detected aborted journal
[ 28.563594] EXT4-fs error (device mmcblk0p2): ext4_journal_start_sb:349: EXT4-fs error (device mmcblk0p2): ext4_journal_start_sb:349:
[ 28.585499] EXT4-fs error (device mmcblk0p2): ext4_journal_start_sb:349: EXT4-fs error (device mmcblk0p2): ext4_journal_start_sb:349:
[ 28.607068] EXT4-fs error (device mmcblk0p2): ext4_journal_start_sb:349: EXT4-fs error (device mmcblk0p2): ext4_journal_start_sb:349: Detected aborted journal
[ 28.661198] EXT4-fs error (device mmcblk0p2): ext4_journal_start_sb:349:
[ 28.668143] EXT4-fs error (device mmcblk0p2): ext4_journal_start_sb:349:
[ 28.785181] Detected aborted journal
[ 28.797944] Detected aborted journal
[ 28.817712] Detected aborted journal
[ 28.904880] Detected aborted journal
[ 28.920482] Detected aborted journal
[ 28.928148] Detected aborted journalDetected aborted journal
[ 29.013488]
[ 29.077584] Registered led device: led0
[ 29.124671]
[ 32.207674] EXT4-fs error (device mmcblk0p2): ext4_remount:4584: Abort forced by user
[ 32.224515] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null)
[ 45.527322] EXT4-fs error (device mmcblk0p2): ext4_remount:4584: Abort forced by user
[ 45.563663] EXT4-fs error (device mmcblk0p2): ext4_remount:4584: Abort forced by user
[ 55.175706] smsc95xx 1-1.1:1.0: eth0: link up, 100Mbps, full-duplex, lpa 0xCDE1

—
Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/linux/issues/280#issuecomment-18213443.
popcornmix commented 11 years ago

https://github.com/Hexxeh/rpi-firmware/commits/next is vanilla 3.8, so sudo BRANCH=next rpi-update will get it.

hvenzke commented 11 years ago

@ottuzzi all devices , include the sd card , are connected via the USB(2) Path - as far as i understood the dmesg kernel output. someone is allowed to correct me..

ghollingworth commented 11 years ago

No not the SDCard that is connected through SDIO

Gordon

From: "", notifications@github.com<mailto:notifications@github.com> Reply-To: raspberrypi/linux reply@reply.github.com<mailto:reply@reply.github.com> Date: Tuesday, 21 May 2013 16:28 To: raspberrypi/linux linux@noreply.github.com<mailto:linux@noreply.github.com> Cc: Gordon Hollingworth gordon@raspberrypi.org<mailto:gordon@raspberrypi.org> Subject: Re: [linux] mmcblk0 error -110 errors (regression?) (#280)

@ottuzzihttps://github.com/ottuzzi all devices , include the sd card , are connected via the USB(2) Path - as far as i understood the dmesg kernel output. someone is allowed to correct me..

Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/linux/issues/280#issuecomment-18216538.

Ferroin commented 11 years ago

@remsnet The SD card is connected through an independent SDHCI interface on the CPU itself.

P33M commented 11 years ago

Edit: My somewhat artificial issue is diverging a bit but bear with me - I think this one is still lurking.

I can reproduce this on 2 different types of SD card (Sandisk 4GB class 4, Kingston 8GB class 10) and with both 3.6 and 3.8 branches. fiq_split doesn't give me the same behaviour, only the benign "missed completion" messages.

If I boot with keyboard attached but eliminate the probing with hid quirks, the timeout errors disappear (presumably because I just told Linux not to probe the boot subclass interface for the keyboard)

I note for the previously recorded errors that other people have seen, the preceding line:

[ 1104.143688] mmc0: final write to SD card still running
[ 1114.157600] mmc0: Timeout waiting for hardware interrupt - cmd12.

is quite telling. The sdhci-bcm2708 driver has already waited 150ms for the stop command to go through as per the default sync_after_dma=1 so this timeout may be genuine - bug is noted in the code comments.

By default the timeout that sdhci specifies is 10 seconds. In my case I can reliably break the same timeout because - guess what - the HID probing timeout is also 10 seconds.

This behaviour in itself seems broken in the sdhci layer - basically the sdhci irq top layer will schedule host->finish_tasklet on completion of a SD card command (whatever it may be, DMA or otherwise). If however the timeout timer expires before this tasklet is scheduled AND preemption or interrupts are disabled then when re-enabled the timer softirq will run before the finish tasklet (softirq pecking order). The sdhci timeout forces the finish_tasklet to discard all data and return -ETIMEDOUT, even though the command may have completed successfully. This is fatal if we have a high latency on either the command completing or preemption (10 seconds is a bit ridiculous, but what about a slow SD card having to do a lot of read-erase-write for multiple sectors coupled with a badly timed preempt_disable()?).

I will poke around a bit more - see what happens if I bump finish_tasklet to the high priority schedule.

hvenzke commented 11 years ago

well maybe thats still the same as #58 issue

sdhci-bcm2708.missing_status=0 sdhci-bcm2708.sync_after_dma=0 at cmdline.txt required per default for class 6 and 10 cards ?!?...

EDIT1:
popcornmix ´s patch realated to that was made at 3.1 kernel patches-3.1/0940-add_sync_after_dma_module_parameter.patch

EDIT2: while googling for "sync_after_dma=" near 99,99% of the results recoment set explicit it to 0 as it caused usb timeouts when set to 1 Thus - sorry - i thing thats an bad coded patch. seems Not fit to class 6 and 10 cards.

licaon-kter commented 11 years ago

I get some of these too on my model B w/ 256Mb, I'm up to BRANCH=next, only ethernet connection, accessing via ssh, usually updating some git ( like dwb browser or i3wm ), Sandisk Ultra 8Gb Class10 30mb/s (SDSDU-008G-U46):

May 18 23:55:42 raspberrypi vmunix: [ 270.869777] mmc0: final write to SD card still running May 18 23:55:52 raspberrypi vmunix: [ 280.819080] mmc0: Timeout waiting for hardware interrupt - cmd12. May 18 23:55:53 raspberrypi vmunix: [ 280.820254] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900 May 19 00:03:35 raspberrypi vmunix: [ 744.194226] mmc0: final write to SD card still running May 19 00:03:48 raspberrypi vmunix: [ 754.216707] mmc0: Timeout waiting for hardware interrupt - cmd12. May 19 00:03:48 raspberrypi vmunix: [ 754.217878] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900 May 19 02:13:54 raspberrypi vmunix: [ 6903.350661] mmc0: final write to SD card still running May 19 02:14:05 raspberrypi vmunix: [ 6913.317326] mmc0: Timeout waiting for hardware interrupt - cmd12. May 19 02:14:05 raspberrypi vmunix: [ 6913.317500] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900 May 22 12:37:52 raspberrypi vmunix: [295812.804925] mmc0: final write to SD card still running May 22 12:38:02 raspberrypi vmunix: [295822.771242] mmc0: Timeout waiting for hardware interrupt - cmd12. May 22 12:38:02 raspberrypi vmunix: [295822.772422] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900

hvenzke commented 11 years ago

@licaon-kter, with that 0x900 msgs indicates an unfinished & corrupped ext4 jornal data. the final jornal write blocked due to sdcard timeout , thus the ext4 goes in bad mode.with fs flag recover_required

save asap somewhere your data ...!!

popcornmix commented 11 years ago

Anyone's problem's fixed with "sdhci-bcm2708.sync_after_dma=0" added to command line?

hvenzke commented 11 years ago

rasberian default changed the cmdline.txt at my own base Image

from

dwc_otg.lpm_enable=0 console=ttyAMA0,115200 kgdboc=ttyAMA0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline rootwait

to

EDIT1 ( thanks to @licaon-kter )

 console=ttyAMA0,115200 kgdboc=ttyAMA0,115200  elevator=noop sdhci-bcm2708.missing_status=0 sdhci-bcm2708.sync_after_dma=0 profile=2 loglevel=7  console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 rootflags=commit=120,data=writeback rootwait ro

will report soon what happen wiile stress the SD card with compile 3.9.y kernel

licaon-kter commented 11 years ago

@remsnet: you can clean up cmdline.txt a bit since they're AFAIK default anyway:

dwc_otg.microframe_schedule=1 sdhci-bcm2708.enable_llm=1 dwc_otg.fiq_fix_enable=1 dwc_otg.lpm_enable=0

and listed twice each:

sdhci-bcm2708.missing_status=0 sdhci-bcm2708.sync_after_dma=0

Ferroin commented 11 years ago

I'm kind of curious, has anyone had this problem with the rpi-3.9.y branch?

hvenzke commented 11 years ago

@Ferroin since the bmc2708 sdhc code not changed mutch , the ERRORS shuold be the same. and thats why i thinking the reported been an serious issue for ALL woldwide deployed RPI

licaon-kter commented 11 years ago

my logs are from 3.9.y

Ferroin commented 11 years ago

@remsnet This isn't just caused by the SDHCI code, but by the interaction of the SDHCI driver with the other drivers, thus changes outside of the driver might affect it as well. Also, as for the ALL RPI part, that is patently false as not all RPi's are running Linux (Although almost every other OS for it has it's own unique problems...).

hvenzke commented 11 years ago

@Ferroin , ok thanks to make that more clear to me as the BMC2708 design PIC i seen mentioned it all via USB somehow. And well its mabee only 98% running linux the rest other , witch is still +1million RPI run linux... Sorry for my ignorance somehow . But the reported looks like an serios issue .

I want run my PIs for years without any interactions , same i did with my old i386/40 till the cpu died 2003 after run 7/24 since 1991.. Stabilty are here my focus and absolut required

While the stress tests - Fresh installed 2013-02 rasberian running plain 3.6.11 rasberian kernel - with stressing past 14h at RPI model B with 64GB class 10 shxc card with kernel 3.9.y compile and samba4.06 compile at same time .. with the folowing set of RPI RUN options at fstab , config.txt , cmdline.txt

  • not used the raspi-config for resize.
  • Modifys the fstab, config.txt , cmdline.txt was done BEFOR the first reboot ( login as pi , sudo bash ... vi .. )
  • fsck was forced befor modify the partions with fdisk , then delete primary 2 , add 2 with same start block but only 4G , then fdisk write , reboot , resize , add 3 & 4 reboot
~ $ cat /etc/fstab
proc            /proc           proc    defaults          0       0
/dev/mmcblk0p1  /boot           vfat    defaults          0       2
/dev/mmcblk0p2  /               ext4    defaults,noatime,nodiratime  0       1
/dev/mmcblk0p3  swap           swap    defaults          0       0
/dev/mapper/System0-LOG_lv  /LOG               ext4    defaults,noatime,nodiratime  0       1
/dev/mapper/System0-usr_src_lv  /usr/src               ext4    defaults,noatime,nodiratime  0       1
/dev/mapper/System0-backup_lv  /backup               ext4    defaults,noatime,nodiratime  0       1
~ $ cat /boot/cmdline.txt
console=ttyAMA0,115200 kgdboc=ttyAMA0,115200  elevator=noop sdhci-bcm2708.missing_status=0 sdhci-bcm2708.sync_after_dma=0 profile=2 loglevel=7  console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 rootflags=commit=120,data=writeback rootwait ro
~ $ cat /boot/config.txt
# uncomment this if your display has a black border of unused pixels visible
# and your display can output without overscan
disable_overscan=1
#uncomment to overclock the arm. 700 MHz is the default.
arm_freq=650
# for more options see http://elinux.org/RPi_config.txt
core_freq=250
sdram_freq=400
over_voltage=0
gpu_mem=32
 ~ $ cat /etc/sysctl.conf
# rpi tweaks
vm.swappiness=1
vm.min_free_kbytes = 8192
# Increase defaults for IPC (bnc#146656)
kernel.msgmax = 65536
kernel.msgmnb = 65536
kernel.sem = 250 256000 32 1024
# By default this script does nothing.
# Increase TCP Buffers to 16 MB
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.rmem_max=16777216
net.core.wmem_max=16777216

  • my SD cards seems significant faster , boots in less then 12 seconds - befor that took up to 90sek
  • no corruption ( YET ) .
  • no more false IO optimszing , due the chosen settings . in fact i not get any errors seen bevor(YET ).

    will report later with 3.8.13 and 3.9.y kernels this weekend

hvenzke commented 11 years ago

@popcornmix due samba4 requirements , see https://wiki.samba.org/index.php/Samba_4/OS_Requirements ,

This kconfig´s are nessary to be Included at default rasberian kconfig please.

  CONFIG_EXT3_FS_XATTR=y
  CONFIG_EXT3_FS_SECURITY=y
  CONFIG_EXT3_FS_POSIX_ACL=y
  CONFIG_EXT4_FS_XATTR=y
  CONFIG_EXT4_FS_SECURITY=y
  CONFIG_EXT4_FS_POSIX_ACL=y
popcornmix commented 11 years ago

We use

CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT23=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y

CONFIG_EXT4_FS_XATTR does not exist. Have you tried samba4?

hvenzke commented 11 years ago

Updated today from samba 4.0.5 to 4.0.7 .

~ :/usr/src/GIT/samba4/samba-v4-0-test# ./bin/smbd --version Version 4.0.7-DEVELOPERBUILD

I RUN samba 4 AD enabled pdc/bdc with RPI model B here . This been why i.e need an stable kernel with no issues on stablity.

The rasberian RPI samba4.0.0 betta2 package been fully outdated . latest samba4 contains and Buildin sldapd and dns .

Ferroin commented 11 years ago

CONFIG_EXT4_FS_XATTR is always true in recent kernels, and is also certain to be true if CONFIG_EXT4_FS_POSIX_ACL is true, because xattrs are how ext4 stores acls.

P33M commented 11 years ago

I can reproduce this on fiq_split and 3.8.

May 25 16:27:02 raspberrypi kernel: [  248.220767] mmc0: final write to SD card still running
May 25 16:27:12 raspberrypi kernel: [  258.231703] mmc0: Timeout waiting for hardware interrupt - cmd12.
May 25 16:27:12 raspberrypi kernel: [  258.232864] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900
May 25 16:27:47 raspberrypi kernel: [  293.491357] mmc0: final write to SD card still running
May 25 16:27:57 raspberrypi kernel: [  303.491706] mmc0: Timeout waiting for hardware interrupt - cmd12.
May 25 16:27:57 raspberrypi kernel: [  303.492884] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900
May 25 16:30:50 raspberrypi kernel: [  476.844081] mmc0: final write to SD card still running
May 25 16:31:00 raspberrypi kernel: [  486.791691] mmc0: Timeout waiting for hardware interrupt - cmd12.
May 25 16:31:00 raspberrypi kernel: [  486.792885] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900
May 25 16:31:01 raspberrypi kernel: [  487.738701] mmc0: final write to SD card still running
May 25 16:31:11 raspberrypi kernel: [  497.691691] mmc0: Timeout waiting for hardware interrupt - cmd12.
May 25 16:31:11 raspberrypi kernel: [  497.692855] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900
May 25 16:31:12 raspberrypi kernel: [  498.622312] mmc0: final write to SD card still running
May 25 16:31:22 raspberrypi kernel: [  508.571694] mmc0: Timeout waiting for hardware interrupt - cmd12.
May 25 16:31:22 raspberrypi kernel: [  508.572879] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900
May 25 16:31:23 raspberrypi kernel: [  509.301761] mmc0: final write to SD card still running
May 25 16:31:33 raspberrypi kernel: [  519.251692] mmc0: Timeout waiting for hardware interrupt - cmd12.
May 25 16:31:33 raspberrypi kernel: [  519.252860] mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900
May 25 16:32:12 raspberrypi kernel: [  558.843552] mmc0: final write to SD card still running
May 25 16:32:22 raspberrypi kernel: [  568.811691] mmc0: Timeout waiting for hardware interrupt - cmd12.

This is with a class 4 Sandisk 4GB SDHC card which has never given me trouble (at least until now). Heavy read/write of the card at the time - was compiling wayland. It appears to have caused data corruption but not filesystem corruption - several compile errors were for bogus characters in generated files which I subsequently found to be truncated with garbage at the end.

With sdhci_bcm2708.sync_after_dma=0, there's no such timeouts occurring and additionally the compile has produced a valid result (i.e. working weston).

oerik commented 11 years ago

I can confirm that with Linux 'raspberrypi 3.8.13+ #454 PREEMPT Fri May 17 15:58:19 BST 2013 armv6l GNU/Linux' and sdhci_bcm2708.sync_after_dma=0 the timeouts are gone! Everything seems ok after boot, I leave the system running for a night and report the status tomorrow.

popcornmix commented 11 years ago

Interesting. sdhci_bcm2708.sync_after_dma=0 is desirable as it avoids a non-standard sleepy/busy waiting in the driver. However if you search around for that, I think you'll find a number of people reporting corruption with sdhci_bcm2708.sync_after_dma=0.

I'm not too familar with the code, but there is a slightly suspicious commented out spinlock nearby.

I wonder if a workaround is required, but this one is buggy.

hvenzke commented 11 years ago

Feeedback on with rpi-update 3.8.13 and compile the 3.9.3 kernel ( 3.9.y ) : No timeouts seen, no corruption. at moment im install the 3.9.3 and will try hurt it ...

So the issue - forgive me - been the sdhci_bcm2708 timings & IO truning behavior.

@popcornmix find an Linux freak that knows the SD Card standards ... and fix it.

Anybody here that can help him with review the code ?! .. ( my own coding skills not fit ) .

Ferroin commented 11 years ago

The claims of corruption caused by sdhic-bcm2708.sync-after-dma=0 seem interesting to me, I haven't had such problems except when the system crashes (and corruption is expected then anyway). I'm planning on switching my rpi's to use nilfs2 instead of ext4 for the root fs on the SD card, it will be interesting to see if the different I/O load created by nilfs2 affects anything.

hvenzke commented 11 years ago

nilfs2 don´t matter mutch. While passt weeks ive tested xfs , ext2/3/4 well with jffs(2) and its NOT significat faster. The while tested , only XFS beats ext4. Cant ´t yet test ZFS as root fs , due the nativ port not compile on RPI ( yet ) .

licaon-kter commented 11 years ago

What about F2FS ?

hvenzke commented 11 years ago

hav´nt tested F2FS . if you have time enoth test it yourself. The most important performance settings are archived as follows:

mount options : noatime,nodiratime
kernel options : elevator=noop stable sdcard class 6/10 disks : sdhci-bcm2708.missing_status=0 sdhci-bcm2708.sync_after_dma=0

just as I allready said : someone shuold assist popcornmix with the sdcard code . The current not 100% fit for class 6/10 with size +16G without kernel the sdhci-bcm2708 options .

Ferroin commented 11 years ago

@remsnet I don't mean in terms of efficiency, I mean in terms of whether it is more corruption resistant.

oerik commented 11 years ago

My problem system has been up for 21 hours hours without any problems, my problem is fixed. However, I'm curious what can be learned from this.

hvenzke commented 11 years ago

@Ferroin , ah got you .. In terms of FS corruption resistant .. XFS beats all . Untill the "HARDWARE" realated timeout things wee seen with i.e 0x900 msgs.

Why XFS ? When an today ext2/3/4/jffs/other goes corrupt , the only way to restore the data been via lost+found with fsck -y -f With XFS you have the XFS tools , and with that you REBUILD the Jornal without data lost ( untill hardware ok , dead blocks marked as sutch , xfs ask you what to do with if one dead found.)

@oerk , well - looks like we learned not trust 100% the given code , and try find an workarround. if that way not work , an spec specialist are required... IO must be stable , thats the path it MUST go , without guessing that it does.

There may be possibly BUGS arround with SPINLOCK inside the sdhci-bcm2708 and the IO Optimiszing for class 6/10 cards.

Ferroin commented 11 years ago

@remsnet While that may be true recovery wise, the log structure in nilfs combined with the way recovery happens on mounting the fs means that it is almost as good as a transactional fs regarding corruption (ie, there is only a very small window in which a crash can occur that will cause a file operation to be only partially complete; most of the time, file operations are almost atomic.). I care less about recovering data than I do about minimizing downtime after a crash.

popcornmix commented 11 years ago

I don't think we can switch to sdhci-bcm2708.sync_after_dma=0 due to: http://www.raspberrypi.org/phpBB3/viewtopic.php?f=28&t=12097&start=425#p161101

where it caused:

[    7.328991] mmcblk0: r/w command failed, status = 0x80000900
[    7.354533] end_request: I/O error, dev mmcblk0, sector 15564536
[    7.368850] Buffer I/O error on device mmcblk0, logical block 1945567
[   10.846507] mmcblk0: r/w command failed, status = 0x80000900
[   10.870303] end_request: I/O error, dev mmcblk0, sector 122864
[   10.884504] Buffer I/O error on device mmcblk0p1, logical block 14334
hvenzke commented 11 years ago

@Ferroin snapshootig don´t help against a fully lost fdisk tab. What i mean ?

While stressting past Weeks , the whole sdcard Partion Table(fdisk tab ) where completely GONE. RPI has regulary NO second root disk, so the journal/ZIL gone , the FS goes either with nilfs/zfs corrupt.

You suffer on all FS types when Partion table been lost untill you save the journal(s) / ZIL on seperat disk and mirror the root disk.

You suffer as well when random sector writes due false IO optimiszing happen to the boot sector parts of an partion.

-> you still need to keep an the partion table details in an seperat secured place ->> i.e backup dvd-rw ->> i.e seperat OCS server that gather the system inverntory

When fstab table gone you can reenter them , but some don´t allow to partion access fully restored that way. As i don´t have tested nil2fs with that , i can not know if it will work again with an former deleted/vanished fdisk tab

Well nilfs maybe more transactional like zfs .. Im an Solaris/ZFS freak - i know how destaster recover the ZIL ... ( wrote my own Wiki for that) and thats hard when code goes wrong like the sdhci-bcm2708 does .

FYI : XFS has changed at kernel 3.3.0, its an transactional fs since then. And i know that it works with deleted/vanished fstab XFS exist since +20years and its OpenSource

hvenzke commented 11 years ago

@popcornmix he been using an outdated kernel with USB/audio.

The most users i spoken past 30 weeks had allmost weak powersupplies with allmost less then 700mA with also unstable / unfitting voltages... .. while with external devices on RPI connected...

min. ~ 700 mA Model B standlone with +16G card while BOOT( my 64G SD use 970 mA peak at boot ) plus 150mA with keyboard and mouse plus 250-600 mA for USB Audio depends on what device..

So, befor say thats not possible the POWER stablity must be checked. Most Users require an powered HUB for theire external extra devices, but use an unpowerd and suffer then. See the msg from i.e Jim JKla in that thread.

Take an Rigth funcional powersource to an PI been the Most important to RUN the PI in stable Conditions But that End users work - not at task for RPI linux kernel .

First of my own used powersources was the cheapest .. what happen ? they blow UP ... EXPLODE within 3 weeks, pi was malfunction during boot, usb keyb i.e not worked.

hvenzke commented 11 years ago
ewheelerinc commented 11 years ago

Hello all,

I'm having the same mmcblk0 with errors like these and stumbled across thread: mmc0: Timeout waiting for hardware interrupt - cmd25. mmc0: Timeout waiting for hardware interrupt - cmd12. mmc0: Timeout waiting for hardware interrupt - cmd13. mmcblk0: error -110 sending status command, retrying mmc0: Timeout waiting for hardware interrupt - cmd13. mmcblk0: error -110 sending status command, retrying mmcblk0: error -110 sending status command, retrying mmc0: Timeout waiting for hardware interrupt - cmd13. mmcblk0: error -110 sending status command, retrying mmc0: Timeout waiting for hardware interrupt - cmd13. mmcblk0: error -110 sending stop command, original cmd response 0x900, card status 0x900

I can reliably reproduce the error in either of 3.6.y and 3.8.y by copying a staged root filesystem from ext4 onto a ZFS filesystem. The RPi uses stock clocking and I have tried multiple power supplies (none of which I have ever had trouble with before).

Reading through the notes above, I tried the rpi-3.6.y branch, copied my kernel .config, rebuilt, rebuilt ZFS, and I am still having trouble. It takes much longer before I start getting the mmc errors.

Incidentally, I tried the sdhci-bcm2708.sync_after_dma=0 fix in 3.8.y, but then then the kernel wouldn't even mount the root filesystem, it would go strait to the mmc0...cmdXX errors---so that is certainly not an option for me.

I'm using a 16GB PNY class-10 (which boasts 20mb/sec on the label and achieves 19MB/s in 4MB write blocks.) The card is labeled SD16G10133MMC2HB in barely readable print on the back. I have two of these cards, both give the same problem (both tested in 3.8.13+, only one was test in 3.6.y).

While the specific configuration may not be relevant since others are having problems with different configurations, here we go: ZFS is configured in ZRAID-1 on 4 SD-card partitions so we can reliably scrub bitflips on cheap media.
mmcblk0: p1 p2 < p5 p6 p7 p8 > p4 ^p5-8: ZFS ^ ^^ p4: ext4 root@rpi:/sys# fdisk -lu /dev/mmcblk0

Disk /dev/mmcblk0: 16.0 GB, 16012804096 bytes 4 heads, 16 sectors/track, 488672 cylinders, total 31275008 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x0006a906

    Device Boot      Start         End      Blocks   Id  System

/dev/mmcblk0p1 16 129040 64512+ c W95 FAT32 (LBA) /dev/mmcblk0p2 129041 3906248 1888604 5 Extended /dev/mmcblk0p4 3906249 31275007 13684379+ 83 Linux /dev/mmcblk0p5 131072 1064959 466944 fd Linux raid autodetect /dev/mmcblk0p6 1073152 2007039 466944 fd Linux raid autodetect /dev/mmcblk0p7 2015232 2949119 466944 fd Linux raid autodetect /dev/mmcblk0p8 2957312 3891199 466944 fd Linux raid autodetect

hvenzke commented 11 years ago

@ewheelerinc
You use native zfs right ? fuse-zfs are not realy ready for rootfs up to my knowlege. if native , what compile options please. I don´t get yet the native compile with 3.8.13 / 3.9.3 on rpi yet. filed an BUG report some days ago..

and what powersuply you use ..?

ewheelerinc commented 11 years ago

@remsnet : I'm cross-compiling zfs-0.6.1 (native) for whatever kernel I build (3.6, 3.8, 3.9 is broken). It works great, just hangs under load with the mmc errors which I believe to be related more to whatever causes bug#280 than to ZFS itself. I suppose I could format a USB stick with ZFS to confirm that it is not a ZFS issue.
((Building ZFS on RPi is offtopic to this thread, so point me at your bug thread if you want more detail. I plan to blog an article on the subject at some point as well.))

I've used both my PC's USB port and a wall plug rated at 1A. No apparent difference.

Does anyone else have an idea what is causing the mmc0 error's we are getting?

What's the next step in troubleshooting?

I have a reliable way to test, takes about 20 minutes for a result. Send me patches or suggestions and I'll be glad to test them.

-Eric

ewheelerinc commented 11 years ago

Also of note, I'm using the "next" firmware branch, no change.

blueridge2 commented 11 years ago

All

For the Rasperry Pi, for the root file system I use ext3fs, because it rock solid stable. We should use ext2fs in my humble opinion or a find a flash card that has micro processor on it that guanenties writes occur or do not occur.

I do all of my development on an nfs mount file system, using cross compilers, which is really neat.

Chip

On 5/27/2013 6:55 PM, Remsnet LTD, wrote:

@ewheelerinc https://github.com/ewheelerinc

You use native zfs right ? fuse-zfs are not realy ready for rootfs up to my knowlege. if native , what compile options please. I don´t get yet the native compile with 3.8.13 / 3.9.3 on rpi yet. filed an BUG report some days ago..

— Reply to this email directly or view it on GitHub https://github.com/raspberrypi/linux/issues/280#issuecomment-18517430.

ewheelerinc commented 11 years ago

@blueridge2 : ext3/4 will burn out the journal blocks over time. Unless you go with an industrial-class SD card with really great wear-leveling the NAND's lifetime will be shortened by using ext3/4. F2FS is the most promising for block-based NAND cards like SD cards, but it doesn't include a block-by-block checksum to detect bit-flips and the like. ZFS can actually scrub bad blocks if its configured right.

ewheelerinc commented 11 years ago

I believe I've found the fix!

set sdhci-bcm2708.enable_llm=0 on your kernel command line.

It looks like the bug was introduced here:

https://github.com/raspberrypi/linux/pull/72

For the first time ever, after trying various different things for over a week, I have successfully copied a root filesystem onto my ZFS tree. Woot!

Looking at drivers/mmc/host/sdhci.c it appears that enable_llm is only in force if CONFIG_PREEMPT is enabled. Keeping CONFIG_PREEMPT enabled and setting enable_llm=0 fixed it for me. I'm going to try again using the 3.8.13+ kernel (this test worked on the 3.6.11+ kernel) and see if it still works.

-Eric

hvenzke commented 11 years ago

@ewheelerinc great work.

Looks like the sdhci-bcm2708 are NOT realy LLM capable, right ?

Espescaly NOT when use SD cards +8G with class 6 / 10

@popcornmix / @ewheelerinc is it possible to add an USER waring into the sdhci-bcm2708 for that ? We shuold let the PI users know abount possibly way of solution(s).

add an reverence document ... i.e under the rpi-3-8.y/Documentation/arm/bmc2708/sdhci-bcm2708.txt I wuold voluteer sutch if interests ..

ewheelerinc commented 11 years ago

On Tue, 2013-05-28 at 21:07 -0700, Remsnet LTD, wrote:

@ewheelerinc great work.

Looks like the sdhci-bcm2708 are NOT realy LLM capable, right ?

It could be a deadlock or race condition (less likely I think on one CPU). Someone who knows that code really well could scrutinize it and possibly provide further troubleshooting. There may be a way to have low latency and safe locking.

Who maintains that part of the code?

ewheelerinc commented 11 years ago

oops.

and now, with formatting:

On Tue, 2013-05-28 at 21:07 -0700, Remsnet LTD, wrote:

@ewheelerinc great work. Looks like the sdhci-bcm2708 are NOT realy LLM capable, right ?

It could be a deadlock or race condition (less likely I think on one CPU). Someone who knows that code really well could scrutinize it and possibly provide further troubleshooting. There may be a way to have low latency and safe locking. Who maintains that part of the code?