raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.19k stars 1.68k forks source link

SDCard corruption on RPI2 #397

Closed NitroG42 closed 7 years ago

NitroG42 commented 9 years ago

Lots of people seems to be affected by an issue with sdcard. I'm using a Samsung Evo 16 Gb micro SDCard, and using raspbian, I encounter every time corruption on the sd card. It's easy to reproduce :

I need to check on my linux system (I'm at work) if I can fix the card at this step or not. I flashed the raspbian img multiple times and it doesn't work.

It also can be reproduce just by making the RPI reboot multiple times through the terminal (using sudo reboot)

I have 3 of them so I hope it's just a firmware bug (I'll try with another one to be sure it' sno the sdcard itself)

Here's two threaeds that gathered this issue (without creating a post in here though :/ ) : http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=101183&p=703772&hilit=error+110#p703772 http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=98935

One post is interesting :

I am using Transcend UHS-I 1U 16GB Class 10 i have tried 4 of this card and same error with all four, i have also tried with 3 different Rpi2 and i could reproduce this error on all of them.

If you want card info, I can give them but you need to tell what to run on which system, because I didn't find a way to print sdcard charateristics from Mac OS X.

popcornmix commented 9 years ago

@ghollingworth does have a Samsung EVO sdcard that he can provoke into corrupting data. He's built an fpga based sdcard analyser that can produce a log of all commands and responses and he's caught an error coming back from the sdcard. He's just got to work out what exactly it is the card is unhappy about and how to avoid it.

For now, Transcend and Samsung EVO cards are best avoided. Other cards don't seem to suffer in the same way. We're pretty sure that a future kernel update will make these cards reliable again and we'll post here when there is something to test.

NitroG42 commented 9 years ago

Holy **\ that was a freaking fast answer. Thank you for the update, I'll wach the topic for future updates.

lurch commented 9 years ago

Duplicate of #372 ?

ghollingworth commented 9 years ago

If your failure is 100% reproducible then it would be interesting to see, currently I'm having trouble reproducing the problem (only seen it twice in the last week) and it makes it very difficult to understand what's going wrong

Gordon

NitroG42 commented 9 years ago

In my first post, I explain how to reproduce it on my card. Basically, after a fresh install of Raspbian, I create a file ( sudo touch /forcefsck ) to run fsck on next boot, I reboot and then lots of errors are found (and it crashes in a very beautiful way).

ghollingworth commented 9 years ago

My question is: is it 100% reproducible? Does it happen without fail every time you boot in this way?

NitroG42 commented 9 years ago

Well I tried 3 or 4 times in row (with a fresh install each time) at the time I created the issue. I'll try again tonight if it does the same, but I didn't see a "no-error" install.

CyrussM commented 9 years ago

Hi,(and sry bad english) ;) I have a Transcend UHS-I 1U 16GB Class 10, and a Sandisk 16GB. I have install raspbian image 2-3 weeks ago and it works fine 24/7. But after 1-2 sometimes 3 reboots the Rpi2 makes a lots of (filesystem) erros and can't boot up. Remove the power and the Rpi2 boot without errors. After this happen again and again i clone the System on the Transcend with dd to a Sandisk 16GB class 10. All problems a solve, reboots no problems anymore (with the same system only a clone/copy from one to another sd card).

johalareewi commented 9 years ago

I have RPi2 and Openelec and got similar problem with Kingston 32GB class 10. After a successful install and setup, a reboot of the Pi2 failed with.

* Error in mount_storage: mount_common: could not mount /dev/mmcblk0p2 *

Starting debugging shell... type exit to quit

sh: can't access tty; job control turned off #

Now using a different card.

ghollingworth commented 9 years ago

Are you using NOOBS to install the software or an image?

Are you updating the image before rebooting?

Gordon

On 10/04/2015 15:08, "johalareewi" notifications@github.com wrote:

I have RPi2 and Openelec and got similar problem with Kingston 32GB class 10. After a successful install and setup, a reboot of the Pi2 failed with. *\ Error in mount_storage: mount_common: could not mount /dev/mmcblk0p2


Starting debugging shell... type exit to quitsh: can't access tty; job control turned off # Now using a different card. � Reply to this email directly or view it on GitHub https://github.com/raspberrypi/firmware/issues/397#issuecomment-91570353 .

johalareewi commented 9 years ago

Using an image. Openelec image (for RPi2) from http://openelec.tv/get-openelec Write to Kingston 16GB class 10 U-1 card using win32diskimager.exe on Windows7 After initial Pi2 startup, Openelec goes through the set up. I installed a few Kodi options then did a reboot and that is when the error message appeared.

ghollingworth commented 9 years ago

Whats the minimal steps required to guarantee it will fail... Include versions of software and links do you actually need to install stuff or will it corrupt without this?

Thanks Gordon

moskichi commented 9 years ago

The minimal steps are install official Openelec or Raspbian image on a Samsung evo 16 U-1, it doesnt matter the way you install it, and switch off the pi while writting on the sd.

I saw a posible solution in other forum, but I haven´t test it yet http://openelec.tv/forum/124-raspberry-pi/75281-openelec-5-0-3-still-corrupts-sd-card-on-pi2?start=15#132032

popcornmix commented 9 years ago

@moskichi Switching the Pi off while writing to the sdcard is expected to cause corruption (with any memory device on any platform). Always shut down before removing power. We're interested here in repeatable causes of corruption that involve shutting down cleanly.

popcornmix commented 9 years ago

@NitroG42 I wonder if you could test this kernel: https://dl.dropboxusercontent.com/u/3669512/temp/kernel7.img By default it should be the same as the current "rpi-update" kernel, but supports some debug options that can be enabled through cmdline.txt. Can you add to cmdline.txt

bcm2835_mmc.mmc_debug=0x1fff

You should see: "mmc_debug:1fff" and "Forcing PIO mode" in dmesg log, and see a reduction in performance. I'd like to know if you still see corruption.

marks5459 commented 9 years ago

integral ultima pro 8gb class 10 upto 20mb/s are a brilliant card hardly have any issues been using for 2 years , was out of stock one time so bought 2 different batches from different suppliers of kingston cards ,one of the suppliers was scan computers in bolton so not like they was clone cards , and had almost everyone back over 3months and 50% of them cant be recognised by any device i put them in

Cy4n1d3 commented 9 years ago

I've bought a 16GB Samsung Evo MicSD when I first got my RPi2 (was still running my good old RPi1 /w normal SD) and experienced the same issues as NitroG42 and johalareewi - after a certain (few, 1 to 3 were sufficient) amount of reboots, the system wouldn't boot up any longer due to mounting errors.

I was able to reproduce the issue pretty much reliably any time back then: install an image (doesn't matter if I used a fresh OE image or my 'old' backuped RPi1 image with RPi2 kernel replacing the Pi1 kernel), do the usual setup stuff like config, addon installations, reboot. I even tried manual 'sync'ing and rebooting over SSH to be safe but after one to three reboots I encountered corruption anyways. As long as the Pi stayed powered on I was able to watch movies, tv shows, youtube and amazon prime without a hitch though... problems arose after the nightly power down or the aforementioned reboots. Didn't matter if I just did a fresh install from image and simply rebooted a few times afterwards or if I did a fresh install and updated to Milhouse's testbuilds using the tar-file - sooner or later I ended up with a corrupted SD card. I was in fact even backing up the whole system to an USB hard drive before rebooting due to the sheer reliability of system corruption occuring ;) After the dreaded card corruption I would then restore my backup which brought the system back to work until the next (or the one following that..) reboot.

I fixed the problem for now by buying a fresh Sandisk card which works without a sign of any flaws until today... got nearly mad at my RPi2 until I tried a 64GB Sandisk MicSD which worked flawlessly right from the first image install. At first I thought it might be power related but after testing 4 different power adapters (Nexus 10, Galaxy S5, generic 5V 2A adapter and a known brand adapter from a local electronics store) I kinda ruled that one out.

If there's anything I can do to maybe help debugging this one please don't hesitate to ask. I'd gladly use the Samsung Evo for my Pi, as it shows nearly doubled 4k writes in comparison to the Sandisk while retaining good 4k reads - I'd really like to run some real world usage scenario performance comparisons on the RPi2 using those cards :)

ernstblaauw commented 9 years ago

Hi Cy4n1d3 ,

If I understand correctly, you can help by testing the kernel that popcornmix posted in this thread or that is posted on OpenELEC's forum: http://openelec.tv/forum/124-raspberry-pi/75281-openelec-5-0-3-still-corrupts-sd-card-on-pi2?start=210#137875

I hope to test this kernel this weekend.

popcornmix commented 9 years ago

Yes, if anyone who is suffering corruption issues can test the kernel linked earlier, or try the OpenELEC test build, that would be very helpful.

Cy4n1d3 commented 9 years ago

I'll try and see if I can still reproduce the error tomorrow.

AlecEdworthy commented 9 years ago

For what it's worth I've installed the patched version of OpenElec 5.0.8 with bcm2835_mmc.mmc_debug=0x1fff set and have had the RasPi 2 go through 17 reboots (remotely triggered using SSH and a loop) without any issues (after 17 I stopped it because I thought that wasn't bad at all and wanted to get on with something else). I have a Transcend 16GB card (described as "Transcend Ultimate 16GB Micro SD (SDHC) Card - Class 10" at purchase, useful IDs from it are man:0x000074 oem:0x4a45 name:USD hwrev:0x0 fwrev:0x2) which previously would corrupt and refuse to boot after 3 or so boot ups (sometimes shutting it down overnight, others just shutting it down long enough to move the PSU to another socket).

OpenELEC:~ # cat /flash/cmdline.txt
boot=/dev/mmcblk0p1 disk=/dev/mmcblk0p2 quiet bcm2835_mmc.mmc_debug=0x1fff
OpenELEC:~ # dmesg | grep mmc-bcm
[    1.273284] mmc-bcm2835 3f300000.mmc: mmc_debug:1fff
[    1.273295] mmc-bcm2835 3f300000.mmc: Forcing PIO mode
OpenELEC:~ # 

Tomorrow or Sunday I'll disable the debug option and send it back through some reboots and see how many I get before corruption occurs...

Thank you to everyone who is working to fix this! :)

Alec

popcornmix commented 9 years ago

If you are happy with bcm2835_mmc.mmc_debug=0x1fff Then I'd be interested if bcm2835_mmc.mmc_debug=0xfff is also good.

AlecEdworthy commented 9 years ago

Sadly I am not happy with bcm2835_mmc.mmc_debug=0xfff as I got four reboots before the RasPi froze at the initial OpenElec screen. Another power cycle and I get dropped to the debugging shell and an fsck is needed to fix it (lots of things it had to fix).

Don't get me wrong, it could be pure luck that 17 reboots passed without issue with 0x1fff and the 18th might have killed it just the same, but 17 compared to 4 is a big difference!

Since switching back to 0x1fff I have just survived a further 10 reboots without issue. Now bed beckons (early start in the morning). If there's any further testing you would like doing then let me know, not sure I'll be able to do any until Sunday now but fire away nevertheless and I'll do my best to oblige.

ernstblaauw commented 9 years ago

Hi AlecEdworthy, do you got your script to remotely reboot the RPi2 for me? That would save me a lot of time. Thanks!

popcornmix commented 9 years ago

Okay if bcm2835_mmc.mmc_debug=0xfff doesn't work, can you confirm if bcm2835_mmc.mmc_debug=0x1000 is okay.

Cy4n1d3 commented 9 years ago

I can still confirm the corruption issues I encountered when I first tried the Samsung EVO card.

Results so far:

OpenELEC:/var/log # dmesg | grep -i mmc-bcm2835 [ 1.421993] mmc-bcm2835 3f300000.mmc: mmc_debug:1fff [ 1.425168] mmc-bcm2835 3f300000.mmc: Forcing PIO mode

OpenELEC:~ # dmesg | grep -i mmc-bcm2835 [ 1.301944] mmc-bcm2835 3f300000.mmc: mmc_debug:fff [ 1.301956] mmc-bcm2835 3f300000.mmc: DMA channels allocated

OpenELEC:~ # dmesg | grep -i mmc-bcm2835 [ 1.301936] mmc-bcm2835 3f300000.mmc: mmc_debug:1000 [ 1.301947] mmc-bcm2835 3f300000.mmc: Forcing PIO mode

So 1fff seems best so far, If someone is willing to share a reboot loop script I will put those modes to further testing.

If you need further information or have more precise instructions please don't hesitate to ask @popcornmix !

popcornmix commented 9 years ago

For automatic rebooting (or running other command) with raspbian, sudo nano /etc/rc.local and add reboot just before the exit.

For openelec I would suggest looking here: http://wiki.openelec.tv/index.php/Autostart.sh

ernstblaauw commented 9 years ago

Could someone provide a script that reboots the rpi2 from ssh each two minutes? In this script, we could count the timew it reboots succesfully? That would make the testing really easy.

popcornmix commented 9 years ago

@ernstblaauw Can you run the script from a linux machine (which could be another pi)? That's a little easier than from windows.

popcornmix commented 9 years ago

I've updated the kernel links above, and the links in OpenELEC thread. New build also has another debug option. Can you try:

bcm2708-dmaengine.dma_debug=0x1f

instead of the bcm2835_mmc.mmc_debug option and report the results.

ernstblaauw commented 9 years ago

@popcornmix, I'm running Linux Mint on my main desktop. I already tried sshpass -p "openelec" ssh root@192.168.0.60 'reboot' but it seems the ssh connection is not closed then. Therefore, it is difficult to count the number of (successful) reboots. Do you got an idea how to do this?

hertzg commented 9 years ago

@ernstblaauw I believe you could add && exit to the command to make sure it closes the ssh session after successfully executing the reboot

sshpass -p "openelec" ssh root@192.168.0.60 'reboot && exit'

AlecEdworthy commented 9 years ago

Hi,

OK for those running Linux or Mac OS X and wanting to do remote reboots of their RasPi running OpenElec you can do this,

  1. In a terminal window run ssh-keygen -t rsa -b 1024 -f ~/.ssh/id_rsa_openelec and when prompted for a passphrase either enter something memorable but secure (there are many guides on the Internet for passphrase creation) or just hit enter (i.e. don't set a passphrase). The former is more secure, the latter is more convenient for this testing (but you should stop the key from being a valid login token on the RasPi once your testing is complete if your RasPi is accessible to untrusted hosts, e.g. the Internet, see the end of the posting). If you choose to set a passphrase then you will need to use an SSH key agent to simplify logging in to the RasPi.
  2. You now need to copy the public half of the key over to the RasPi and put it in the appropriate file to permit remote access, you can do this with this command, cat ~/.ssh/id_rsa_openelec.pub | ssh root@192.168.2.7 "cat - >> .ssh/authorized_keys" when prompted enter openelec as the password (you should substitute 192.168.2.7 in that command and subsequent ones for the IP address of the RasPi or its hostname if appropriate)
  3. If you decided to add a passphrase to the SSH key then you now need to load your key agent and add the key to the key agent,
    user@your-mac:~$ eval `/usr/bin/ssh-agent`
    Agent pid 45403
    user@your-mac:~$ ssh-add .ssh/id_rsa_openelec
    Enter passphrase for .ssh/id_rsa_openelec: 
    Identity added: .ssh/id_rsa_openelec (.ssh/id_rsa_openelec)
    user@your-mac:~$ 
  1. You should now be able to log in to the RasPi using the SSH key without entering a password using the command ssh root@192.168.2.7 -i ~/.ssh/id_rsa_openelec and you should be able to issue remote commands automatically too, e.g.
     user@your-mac:~$ ssh root@192.168.2.7 -i .ssh/id_rsa_openelec whoami
     root
     user@your-mac:~$ 

The command I actually used to do the reboot cycles was,

RUN=0; while $(true); do RUN=$[$RUN+1]; echo Reboot cycle $RUN; ssh -i ~/.ssh/id_rsa_openelec root@192.168.2.7 reboot; sleep 10; CHK=1; while [[ $CHK -eq 1 ]]; do echo Checking if back; sleep 1; (ping -c 1 -t 1 192.168.2.7 2>&1 > /dev/null) && CHK=0; done; echo Openelec is back; sleep 20; done

That is a very long line so be careful with cutting and pasting etc. Basically what it breaks down into is,

Establish a variable to count the runs RUN and start a loop which never ends while $(true); do, which,

To exit the loop and stop the reboot cycle you need to press CTRL-C a few times - depending on which stage it's at the first CTRL-C will only exit part of the loop so press it three or four times, it won't harm anything to press it more then is necessary.

To stop the SSH key you created from being allowed to log into the RasPi you need to delete it from /storage/.ssh/authorized_keys or delete the file in its entirety.

Hope that helps, Alec

AlecEdworthy commented 9 years ago

Using the latest update (which I imaged onto the card and then restored my settings from backup) and bcm2708-dmaengine.dma_debug=0x1f I just managed three reboots before the system froze at boot and needed fscking. Fsck'd it, booted successfully and restarted the reboot cycle testing and it failed on the first reboot.

popcornmix commented 9 years ago

@AlecEdworthy Does bcm2835_mmc.mmc_debug=0x1000 work for you?

AlecEdworthy commented 9 years ago

Just used the same image but bcm2835_mmc.mmc_debug=0x1fff and got 13 reboots without issue. Now trying 0x1000...

AlecEdworthy commented 9 years ago

Looks like 0x1000 is fine from a corruption point of view. Just survived 13 reboots without issue. Getting around 7.2MB/sec write and 16.6MB/sec read speeds with that option FWIW.

popcornmix commented 9 years ago

@AlecEdworthy any hangs with that setting? (@Cy4n1d3 reporting some hanging on boot, but no corruption). 0x1000 disables DMA in the sdcard driver (so uses PIO mode). I had hoped the DMA wait states might produce a workaround. Can you confirm you see: bcm2708-dmaengine soc:dma@7e007000: dma_debug:1f in dmesg log when using bcm2708-dmaengine.dma_debug=0x1f?

AlecEdworthy commented 9 years ago

When running with bcm2708-dmaengine.dma_debug=0x1f I was getting the line you asked me to check for,

[    0.829979] bcm2708-dmaengine soc:dma@7e007000: dma_debug:1f

To recap,

Each time I get corruption what happens is the box reboots fine, freezes at the first page of the reboot (where it lists the OpenElec version) and sits there. If I then power it off and on again it boots to the debug console and requires extensive repairs using fsck (leading to lost settings quite frequently, I have had to restore from backup a couple of times).

A

popcornmix commented 9 years ago

There was another test bit added in last update. Try: bcm2708-dmaengine.dma_debug=0x1f bcm2835_mmc.mmc_debug=0x2000 which disables the MMC_QUIRK_BLK_NO_CMD23 quirk.

AlecEdworthy commented 9 years ago

Looks good, 13 reboots without issue using bcm2708-dmaengine.dma_debug=0x1f bcm2835_mmc.mmc_debug=0x2000, write speeds of 6.9MB/sec and read speeds of 14.9MB/sec (off one test using dd, /dev/zero and a 500MB file).

popcornmix commented 9 years ago

@AlecEdworthy that is interesting. Can you reduce dma_debug? e.g. Try 0x10, then 0x8 then 0x4 then 0x2 then 0x1 and then with it removed?

AlecEdworthy commented 9 years ago

@popcornmix I'll give them a try. Probably won't be until later this evening but I'll try to get some testing in today. I assume I keep mmc_debug set to 0x2000 while altering dma_debug?

popcornmix commented 9 years ago

Yes

Cy4n1d3 commented 9 years ago

Using bcm2708-dmaengine.dma_debug=0x1f while running the latest .img does not produce good results for me. I made a fresh setup, it rebooted once to partition the storage-partition and then it hung on the initial version display screen, didn't even get the mounting error / debug shell. After a few seconds the screen goes dark, nothing being displayed anymore. Same result after power cycling.

I've then changed the cmdline to bcm2708-dmaengine.dma_debug=0x1f bcm2835_mmc.mmc_debug=0x2000 using the same install, which let the system boot up. I configured SSH, logged in and verified the following lines inside dmesg: [ 0.857419] bcm2708-dmaengine soc:dma@7e007000: dma_debug:1f [ 1.302699] mmc-bcm2835 3f300000.mmc: mmc_debug:2000 Updated the addons and started rebooting, which allowed 3 reboots after which I got a freeze on after CEC adapter detection. Power cycling then once again allowed booting and another series of 13 reboots without bootup-corruption or system hangs.

Quick and dirty speed testing (/dev/zero, dd /w 1024 block size and 500 mb file size) revealed the following numbers on this scenario (average of three samples): Write: 7.96 MB/s Read: 14.46 MB/s

Did another reboot afterwards which also succeeded.

AlecEdworthy commented 9 years ago

OK,

Where to next @popcornmix? A

popcornmix commented 9 years ago

@AlecEdworthy Thanks. For completeness running with 0x0 (or dma_debug removed) would be useful.

Also, with dma_debug removed, I'd be interested in:

bcm2835_mmc.mmc_debug=0xffff0000
AlecEdworthy commented 9 years ago

So just to confirm you want tests running with,

The dma_debug can be removed completely in both cases.

Kind regards, Alec

EDIT: "and" replaced with "can" in last line above.

popcornmix commented 9 years ago

Yes

ernstblaauw commented 9 years ago

Hi popcornmix,

This weekend I tested the setting bcm2835_mmc.mmc_debug=0x1fff on a 16GB Transcend: I did 20 reboots without issue. This is much better than the default values (3 times I got corruption during resizing; one time it survived the resizing and after that the RPi was able to reboot 6 times before corruption striked). A huge improvement!

I could do some more tests, but I lost track which one are currently needed. Could you provide an overview which settings you want us to test?