raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.18k stars 1.68k forks source link

SDCard corruption on RPI2 #397

Closed NitroG42 closed 7 years ago

NitroG42 commented 9 years ago

Lots of people seems to be affected by an issue with sdcard. I'm using a Samsung Evo 16 Gb micro SDCard, and using raspbian, I encounter every time corruption on the sd card. It's easy to reproduce :

I need to check on my linux system (I'm at work) if I can fix the card at this step or not. I flashed the raspbian img multiple times and it doesn't work.

It also can be reproduce just by making the RPI reboot multiple times through the terminal (using sudo reboot)

I have 3 of them so I hope it's just a firmware bug (I'll try with another one to be sure it' sno the sdcard itself)

Here's two threaeds that gathered this issue (without creating a post in here though :/ ) : http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=101183&p=703772&hilit=error+110#p703772 http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=98935

One post is interesting :

I am using Transcend UHS-I 1U 16GB Class 10 i have tried 4 of this card and same error with all four, i have also tried with 3 different Rpi2 and i could reproduce this error on all of them.

If you want card info, I can give them but you need to tell what to run on which system, because I didn't find a way to print sdcard charateristics from Mac OS X.

popcornmix commented 9 years ago

I think 0xfff doesn't help. dma_debug doesn't help. mmc_debug=0x1000 is known to help, but is undesirable as a solution as it disables DMA, and so increases CPU.

mmc_debug=0x2000 is the most promising one, which seems to help without performance issues. It does disable a fix that was added for a specific sdcard, so it can't just be enabled as a default, but it is a setting we'd like to gather as much information on as possible.

mmc_debug=0xffff0000 is currently unconfirmed, but we'd like to know if it helps.

So, for now, please test: mmc_debug=0x2000 and then mmc_debug=0xffff0000

AlecEdworthy commented 9 years ago

@ernstblaauw With bcm2835_mmc.mmc_debug=0x1fff did you have issues (corruption, freezing etc) on the 21st reboot or just stop there because it was all going OK?

AlecEdworthy commented 9 years ago

OK @popcornmix,

Is there a way to force the boot process to pause at the initial screen and give me the debug terminal instead of going through a normal boot? I ask because I would like to force an fsck of the SD card but can't from the normal OpenElec screen because I can't unmount /storage. I've tried break=load_modules and break=check_disks in /flash/cmdline.txt but neither worked (both booted as normal).

Alec

popcornmix commented 9 years ago

Okay, if bcm2835_mmc.mmc_debug=0xffff0000 is too slow, try bcm2835_mmc.mmc_debug=0x7f7f0000 or bcm2835_mmc.mmc_debug=0x3f3f0000 or bcm2835_mmc.mmc_debug=0x1f1f0000 until you get a usable sort of speed.

AlecEdworthy commented 9 years ago

Following up on my earlier post, with bcm2835_mmc.mmc_debug=0xffff0000 after four or so reboots it looked like some of my settings had been damaged (sound effects were turned back on, overscan was disabled). The system was however still booting at that point. I aborted after 9 reboots (but it was still slowly cycling) and am doing a full reformat (using the SD Association's Formatter tool) before re-imaging, restoring from backup and then trying the other options. Might not get around to the other options tonight I'm afraid :-(

A

MilhouseVH commented 9 years ago

I've tried break=load_modules and break=check_disks in /flash/cmdline.txt but neither worked (both booted as normal).

Add debugging to cmdline.txt as well.

Also, textmode (add to cmdline.txt) is useful if you don't want to load Kodi and just want OpenELEC to boot into a console (although /storage will be mounted).

ernstblaauw commented 9 years ago

Hi, Below you'll find my test results, including my earlier reported findings.

default cmdline 3 times corrupt during resizing

bcm2835_mmc.mmc_debug=0x1fff

# dmesg | grep mmc-bcm
[    1.302113] mmc-bcm2835 3f300000.mmc: mmc_debug:1fff
[    1.302124] mmc-bcm2835 3f300000.mmc: Forcing PIO mode

20 reboots: no crash

bcm2835_mmc.mmc_debug=0x2000

# dmesg | grep mmc-bcm
[    1.302496] mmc-bcm2835 3f300000.mmc: mmc_debug:2000
[    1.302509] mmc-bcm2835 3f300000.mmc: DMA channels allocated

20 reboots: no crash

bcm2835_mmc.mmc_debug=0xffff0000 Boots really slowly, I aborted this one

bcm2835_mmc.mmc_debug=0x7f7f0000 Boots into Kodi, but not very fast. Wifi did not come up, so I couldn't test this via ssh

bcm2835_mmc.mmc_debug=0x3f3f0000

# dmesg | grep mmc-bcm
[    1.302822] mmc-bcm2835 3f300000.mmc: mmc_debug:3f3f0000
[    1.302834] mmc-bcm2835 3f300000.mmc: DMA channels allocated

It looks quite slow (for sure it boots much slower than 0x2000). 10 reboots: no crash

To be precise: no crash means I stopped the testing by hand and thus no corruption took place. I used the following test command:

RUN=0; while $(true); do RUN=$[$RUN+1]; echo Reboot cycle $RUN; sshpass -p "openelec" ssh root@192.168.0.60 '({ sleep 2; reboot; } >/dev/null &) ; exit '  ; sleep 10; CHK=1; while [[ $CHK -eq 1 ]]; do echo Checking if back; sleep 1; (ping -c 1 -t 1 192.168.0.60 2>&1 > /dev/null) && CHK=0; done; echo Openelec is back; sleep 20; done
AlecEdworthy commented 9 years ago

Do you need/want repeated reboot tests with 0x3f3f0000 or 0x1f1f0000?

A

popcornmix commented 9 years ago

I'd like to find out the smallest delays that avoid the corruption. Assuming 0x1f1f0000 doesn't corrupt, then continue with 0x0f0f0000, 0x08080000, 0x04040000, 0x02020000, 0x01010000. The lower numbers will be better performance, but I'd imagine at some point you'll start seeing corruption. Hopefully it will be at a small enough number that performance isn't measurably affected.

AlecEdworthy commented 9 years ago

@popcornmix how would you suggest measuring the performance hit? Are you thinking in terms of data throughput or overall system responsiveness? I suspect there will be a big difference between the two, especially where one person is predominantly streaming (like me) compared to another who is playing media from the SD card itself.

A

popcornmix commented 9 years ago

As long as we get below the 0x0f0f0000 numbers I suspect it won't be significant (that is 15us per sdcard host control register access - the actual data goes over dma, so there will only be a few of those per sector). Just finding that smallest values that don't corrupt is the key piece of information.

We can then do performance tests under raspbian (e.g. sudo hdparm -t /dev/mmcblk0 or Bonnie++) to be sure, but I suspect it won't be an issue.

AlecEdworthy commented 9 years ago

Thanks @popcornmix I will get back to you later tonight I hope with data about potential corruption.

AlecEdworthy commented 9 years ago

0x1f1f0000 and 0x0f0f0000 went fine, 20 reboots, no corruption, md5sum of the /flash and /storage folder structures showed no files had altered unexpectedly (i.e. beyond log files, and other files which are modified on boot etc).

0x08080000 appears to have caused problems following the fourth reboot in a row. The power LED is on solid (as you'd expect), the activity light was mostly solid with the occasional flicker and remained like that for around four minutes, now the activity light has gone off leaving just the power LED. No output on the HDMI at all. I'll investigate and report back further...

...manually power cycled and it back without issue. Looking around the filesystem showed no damage. Restarted the reboot cycles and the next reboot froze at the initial OpenElec screen (OpenELEC (your) - Version: 5.0.8 [Build #0418]). Waiting to see if it moves on, remains as it is or goes dark...

...remained frozen at that stage. Power cycled and it came back, no harm apparent. Another three reboots and it froze again. Power cycled and it came back again with no harm apparent. After this it completed the remaining reboot cycles without freezes to bring it to 20 in total and at the end of them showed not altered files through the md5sum check. I've popped the SD card out to run an fsck using a separate RapPi to check for any unseen issues and none were found. On with 0x04040000 :)

A

AlecEdworthy commented 9 years ago

0x04040000 isn't looking good. Two reboots and it's frozen on the initial OpenElec screen. Two subsequent cold boots and it still won't get beyond there (not even to a debug console for me to run fsck et al).

Set debugging and break=load_modules and got to a debug console. fsck showed no signs of trouble. Edited cmdline.txt to read,

boot=/dev/mmcblk0p1 disk=/dev/mmcblk0p2 bcm2835_mmc.mmc_debug=0x04040000 debugging nosplash progress

rebooted and got to OpenElec without issue. No signs of unexpected filesystem change (comparing md5sums). Restarted the reboot cycles with cmdline.txt as shown above (so I could monitor progress) and the next reboot froze at Starting Kodi sources Setup....

Alec

AlecEdworthy commented 9 years ago

@MilhouseVH - thanks for the tip about adding debugging to cmdline.txt, worked a treat (but you know it would already of course ;-)

A

ernstblaauw commented 9 years ago

Hi,

I was wondering what is the best test method. As far as I understand, we did not yet identify the root cause. Thus it is likely we'll find a setting that is fast and reliable in our tests, but will still corrupt in a week or a month time. Or am I to pessimistic?

AlecEdworthy commented 9 years ago

Four more reboots with 0x04040000 (total 8 so far) and a freeze, cold boot and got to OpenElec, next reboot froze, cold boot and it got to OpenElec as expected. From here reboots to bring me to 19 completed passed without issue but the 20th froze at Starting Kodi hacks... and required a manual power cycle after which it booted to Kodi and showed no signs of corruption (according to fsck on another RasPi).

0x02020000 testing underway (with debugging, nosplash and progress enabled to make it easier to monitor progress).

A

popcornmix commented 9 years ago

When you decide which is the lowest setting that is reliable, perhaps 0x0f0f0000, can you try 0x0f000000 and 0x000f0000. That will determine which if the two places the delay is inserted it the critical one.

AlecEdworthy commented 9 years ago

0x02020000 froze (at Starting Kodi sources setup...) on the 11th and 17th reboots. A cold boot fixed it after the 11th reboot but the 17th was fatal and needed fairly extensive fsck work on another Pi (log kept if you're interested).

To be safe I've re-imaged and restored from backup before carrying out the 0x0f0f0000 and 0x0f000000 testing (underway now).

Is this issue only likely to manifest itself when using a vulnerable mini-SD card in the on-board mini-SD card slot or could we see the same sort of corruption when using a vulnerable mini-SD card in a USB based mini-SD card reader? I'm assuming its also limited to the RasPi 2 model?

Alec

popcornmix commented 9 years ago

I wouldn't expect to see problems with that sdcard in a USB adapter. It's an issue with the bcm2835-mmc driver and certain sdcards. We suspect that the problem is worse on Pi2 due to the higher speed allowing sdcard accesses to occur closer together.

AlecEdworthy commented 9 years ago

OK, with 0x0f0f0000 I got a freeze after 4 reboots which was not fixable (e2fsck reported the journal version was not supported by this e2fsck). Re-imaged and soon to be trying 0x000f0000.

A

AlecEdworthy commented 9 years ago

0x000f0000 gave issues too so I moved back to 0x0f0f0000 to re-test it and that caused issues too after a few reboots. I've retracted even further to 0x1f1f0000 and that has been stable over 30 reboots or so now. I'll leave it rebooting overnight (with a five minute delay between them this time rather than the 30 seconds or so I've been using) and see if it remains stable over a longer period.

I wonder how many years I've in effect added to my mini-SD card and RasPi's lives with all this testing, rebooting etc. etc. (given they're probably not made for rebooting quite as frequently as I've been doing during this testing)...

A

lurch commented 9 years ago

I've been watching with interest the progress that's being made in this thread - very impressive amount of debugging effort you're putting in @AlecEdworthy :-)

@popcornmix When he finds the "optimal" setting, will that then fix it for all Pis and (micro)SD-cards, or is it possible that different cards from different manufacturers will fail / succeed with different debug (delay) values?

AlecEdworthy commented 9 years ago

OK, so overnight my RasPi carried out over 80 reboots without issue with bcm2835_mmc.mmc_debug=0x1f1f0000 so given the instability (no corruption but freezes on boot) I've seen with lower values this leads me to suggest that this currently as low as we can go for stability.

EDIT: In for a penny, in for a pound, I've taught my other RasPi how to remotely reboot my OpenElec RasPi and set it on the 5 minute reboot cycle with the OpenElec RasPi running with mmc_debug:f0f0000 to see if a more conservative reboot cycle causes fewer issues while I'm out at work. It automatically stops the cycle if the RasPi takes more than three minutes to come back.

EDIT2: Well that was short lived. Two reboots and it froze part way through the start up. I stick by my original statement, bcm2835_mmc.mmc_debug=0x1f1f0000 is as far as we can go and maintain (perfect?) stability.

A

popcornmix commented 9 years ago

@lurch at the moment we're still gathering information. These debug delays won't necessarily be the final fix. The fact that delays help suggests it is not a logical bug, but is probably a timing bug where we are doing something some sdcards do not like (perhaps violating the minimum delay between a cmd X and cmd Y being sent to sdcard).

If we understand the problem fully, then we will likely know the exact delay required, so it should work for all sdcards. If we don't then we'll have to go with a fix that cures all the tested sdcards and if other sdcards have issues in the future we may need further tweaking. Obviously the more users that can help test now the better.

@AlecEdworthy your Pi won't suffer any ill effect from frequent rebooting. The sdcard does have a limited number of writes lifespan, but this is likely to be of the order of 100k. I'm hoping we'll resolve this issue long before that is approached...

AlecEdworthy commented 9 years ago

Thanks for the reassurance @popcornmix. Not too worried about the SD card but I was starting to wonder if the repeated warm restarts (and the less frequent cold power ons) might take their toll on the Pi from a sort of maximum actuations (power-up, power-down, power-up, power-down...) point of view (thinking of it like a switch). Again I'm guessing it's rated in a hundreds of thousands if not millions of cycles before any real chance of failure.

Am I right in thinking that I've got about as far as I can with the testing for now @popcornmix?

On a related note, does the boot option debugging have any effect on the Pi and its running beyond enabling the break= boot options? i.e. does putting debugging in cmdline.txt (without any break= option) have a potential to affect testing beyond just allowing you to review the progress of the boot sequence? I know progress and nosplash provide that access too but I wasn't sure if it was necessary to remove debugging when trying to match normal running conditions as closely as possible? I've tended to leave it in to avoid having to put it back in each time I needed to break the boot sequence out in order to do fsck'ing etc. but perhaps I should have removed it each time along with the break= option...

A

popcornmix commented 9 years ago

@AlecEdworthy I'd still like to know if 0x1f000000 or 0x001f0000 is reliable. I suspect only one of the delays is required.

MilhouseVH commented 9 years ago

On a related note, does the boot option debugging have any effect on the Pi and its running beyond enabling the break= boot options?

It will cause debug information to be logged in journalctl and kodi.log.

AlecEdworthy commented 9 years ago

@MilhouseVH Thanks for clarifying that, I'll drop it unless I need the break= options then.

@popcornmix I know what I'll have my Pi doing while revising my PRINCE2 learning tonight then... ;-)

AlecEdworthy commented 9 years ago

OK, 0x1f000000 survived 6 reboots before sufficient corruption to cause fsck to be unable to fix /dev/mmcblk0p2 (/storage).

A

AlecEdworthy commented 9 years ago

0x001f0000 has just survived 20 sequential reboots without issue. @popcornmix should I keep testing this value with more reboots or is there refinement to the value you'd prefer testing?

A

popcornmix commented 9 years ago

I think 20 good reboots sounds like enough. You could try whittling down 0x1f000000 a little (e.g. 0x18000000 and then 0x14000000 if the first one works, and 0x1c000000 if it fails), but that's not critical.

I've got some extra tests to narrow it further, but that needs a new OE build. I'll make the changes and kick that offf...

AlecEdworthy commented 9 years ago

@popcornmix I assume you mean whittle down 0x00f10000 (which works OK after 44 reboots now) rather than 0x1f000000 (which corrupted after 6)?

popcornmix commented 9 years ago

Correct.

AlecEdworthy commented 9 years ago

@popcornmix Cool, though it would help if I could get my 1's and f's around the right way too, 0x001f0000 I meant ;-)

popcornmix commented 9 years ago

Links in OE forum have been updated to a new build. New debug option bcm2835_mmc.mmc_debug2 added to disable some of the delays.

There are 10 calls to the write register function that calls the delay function. You disable the delay by setting bits in mmc_debug2. I'm hoping only one delay is actually required.

So, if you could sanity check: bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x0 should behave well and bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x3ff should corrupt. (all delays disabled)

Confirm that is true. You should be able to binary chop the bits in mmc_debug2. E.g. try 0x1f. If that corrupts, clear some bits, e.g. 0x3. If it succeeds set some bits, e.g. 0x7f. Ideally in four iterations you will find a one-bit change that switches from corrupting to not-corrupting.

If you are unclear what I'm asking, then try 0x1f and let me know the outcome and I'll suggest the next value to try.

AlecEdworthy commented 9 years ago

@popcornmix been trying values of mmc_debug and have come to the conclusion that,

I then started making intermediate values and 0x001b0000, 0x001a0000 and 0x00190000 were good but 0x00180000 again caused issues.

I'll take a look at the new OE build.

A

AlecEdworthy commented 9 years ago

@popcornmix The .img link in the OE forum seems to point to the tar file update package not an image, is that deliberate and do you plan to make a .img available please (makes it easier to fix corruption if I can just re-image rather than re-image and then have to update too).

A

popcornmix commented 9 years ago

Try now.

AlecEdworthy commented 9 years ago

Thank you! Trying bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x0 now :)

AlecEdworthy commented 9 years ago

OK, 20 reboots with bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x0 has gone without a hitch. Moving to bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x3ff to sanity check that end of the scale then I'll start picking intermediate values. I think I understand your comment about binary chopping the bits and finding a solution in four iterations :)

AlecEdworthy commented 9 years ago

Sanity check confirmed,

I'll re-image (to ensure no damaged data remains) and start binary chopping :)

AlecEdworthy commented 9 years ago

OK, more info,

Sanity checking,

Binary chopping,

Given mmc_debug2=0x3 seems the highest value we can achieve without freezes I'll re-image the card again and then kick off an overnight reboot test with this value (unless you suggest other testing instead/first).

Alec

AlecEdworthy commented 9 years ago

Overnight testing with bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x3 resulted in corruption after 106 cycles. I added in an SD card performance test (write 500MB file, read 500MB file, delete 500MB file) which saw a consistent 10MB/sec write and 16MB/sec read and added an additional 30 second pause leaving the cycle time at 2m30s approximately. I've kicked off the same test using bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x0 to soak it instead.

A

Cy4n1d3 commented 9 years ago

bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x0 did run fine for a 4 hours reboot-loop. I did not count the amount of reboots though, should have been an awful lot however.

bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x3ff froze during boot three times, didn't try any further then.

bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x1f did get a bit further than 0x3ff but freezes before being able to mount the file systems in all three attempted boots.

I must admit I'm not really up to par in chopping binarys, further instructions on what exactly I might be testing next would be helpful :) Especially as my card / Pi2 seems to be a bit more picky then Alec's..

popcornmix commented 9 years ago

We know is all bits of mmc_debug2 are zero, then things are good and if all bits of mmc_debug2 are ones then it corrupts. We want to find the value of mmc_debug2 that works well, and has the most bits set. I'm hoping only one of the delays is required, which would mean one of the values with 9 bits set. if 0x1f is a failure, then you could sanity check the inverse of that which should pass. Try 0x3e0

AlecEdworthy commented 9 years ago

I've had 153 successful reboots under bcm2835_mmc.mmc_debug2=0x0 so I am sure that is safe as houses.

Given I have had issues with bcm2835_mmc.mmc_debug2=0x3 (so in binary terms 0000000011) I have set bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x2 (so 0000000010) to see if I can determine which bit of these gave me problems. Do we then need to start slicing and dicing the higher bits (e.g. setting 0x1111000000 aka 0x3c0 etc)?

A

AlecEdworthy commented 9 years ago

@Cy4n1d3 What make and model SD card are you using? Mine is a Transcend Ultimate 16GB Micro SD (SDHC) Card, man:0x000074 oem:0x4a45 name:USD hwrev:0x0 fwrev:0x2

That list of details can be obtained using the commands,

cd /sys/class/mmc_host/mmc?/mmc?:*
echo "man:$(cat manfid) oem:$(cat oemid) name:$(cat name) hwrev:$(cat hwrev) fwrev:$(cat fwrev)"

Alec

popcornmix commented 9 years ago

@AlecEdworthy

Given I have had issues with bcm2835_mmc.mmc_debug2=0x3 (so in binary terms 0000000011) I have set bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x2 (so 0000000010) to see if I can determine which bit of these gave me problems. Do we then need to start slicing and dicing the higher bits (e.g. setting 0x1111000000 aka 0x3c0 etc)?

If you believe 0x3 is bad, then it's worth checking the inverse 0x3fc is good. If they are both bad then it suggests there is more than one place that needs the delay.

Cy4n1d3 commented 9 years ago

bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x3e0 does at least boot up in most of my attempts so far but still had a freeze on boot for me. Couldn't reproduce the freeze in ~20 reboots though.

@AlecEdworthy - I'm running a Samsung 16GB Evo Cl10 UHS-1 for these tests. These are the details, thanks for the c&p commands :) man:0x00001b oem:0x534d name:00000 hwrev:0x1 fwrev:0x0