Closed NitroG42 closed 7 years ago
I think 0xfff doesn't help. dma_debug doesn't help. mmc_debug=0x1000 is known to help, but is undesirable as a solution as it disables DMA, and so increases CPU.
mmc_debug=0x2000 is the most promising one, which seems to help without performance issues. It does disable a fix that was added for a specific sdcard, so it can't just be enabled as a default, but it is a setting we'd like to gather as much information on as possible.
mmc_debug=0xffff0000 is currently unconfirmed, but we'd like to know if it helps.
So, for now, please test:
mmc_debug=0x2000
and then mmc_debug=0xffff0000
@ernstblaauw With bcm2835_mmc.mmc_debug=0x1fff
did you have issues (corruption, freezing etc) on the 21st reboot or just stop there because it was all going OK?
OK @popcornmix,
bcm2835_mmc.mmc_debug=0x2000
on its own ran wonderfully, 14 reboots without freezes or corruption issues.bcm2835_mmc.mmc_debug=0xffff0000
on its own is running very badly. From issuing the reboot command to getting the rainbow loading screen took 60 seconds, from the rainbow screen to the networking stack being up was another 132 seconds and the OpenElec main screen finally appeared a total of 296 seconds (almost five minutes) after the rainbow screen. During the boot the animated Kodi splash screen stuttered a number of times. I might try some reboot cycles later tonight but given the six minute cycle time it may be a little tricky...Is there a way to force the boot process to pause at the initial screen and give me the debug terminal instead of going through a normal boot? I ask because I would like to force an fsck of the SD card but can't from the normal OpenElec screen because I can't unmount /storage. I've tried break=load_modules
and break=check_disks
in /flash/cmdline.txt but neither worked (both booted as normal).
Alec
Okay, if bcm2835_mmc.mmc_debug=0xffff0000
is too slow, try bcm2835_mmc.mmc_debug=0x7f7f0000
or bcm2835_mmc.mmc_debug=0x3f3f0000
or bcm2835_mmc.mmc_debug=0x1f1f0000
until you get a usable sort of speed.
Following up on my earlier post, with bcm2835_mmc.mmc_debug=0xffff0000
after four or so reboots it looked like some of my settings had been damaged (sound effects were turned back on, overscan was disabled). The system was however still booting at that point. I aborted after 9 reboots (but it was still slowly cycling) and am doing a full reformat (using the SD Association's Formatter tool) before re-imaging, restoring from backup and then trying the other options. Might not get around to the other options tonight I'm afraid :-(
A
I've tried break=load_modules and break=check_disks in /flash/cmdline.txt but neither worked (both booted as normal).
Add debugging
to cmdline.txt as well.
Also, textmode
(add to cmdline.txt) is useful if you don't want to load Kodi and just want OpenELEC to boot into a console (although /storage
will be mounted).
Hi, Below you'll find my test results, including my earlier reported findings.
default cmdline 3 times corrupt during resizing
bcm2835_mmc.mmc_debug=0x1fff
# dmesg | grep mmc-bcm
[ 1.302113] mmc-bcm2835 3f300000.mmc: mmc_debug:1fff
[ 1.302124] mmc-bcm2835 3f300000.mmc: Forcing PIO mode
20 reboots: no crash
bcm2835_mmc.mmc_debug=0x2000
# dmesg | grep mmc-bcm
[ 1.302496] mmc-bcm2835 3f300000.mmc: mmc_debug:2000
[ 1.302509] mmc-bcm2835 3f300000.mmc: DMA channels allocated
20 reboots: no crash
bcm2835_mmc.mmc_debug=0xffff0000 Boots really slowly, I aborted this one
bcm2835_mmc.mmc_debug=0x7f7f0000 Boots into Kodi, but not very fast. Wifi did not come up, so I couldn't test this via ssh
bcm2835_mmc.mmc_debug=0x3f3f0000
# dmesg | grep mmc-bcm
[ 1.302822] mmc-bcm2835 3f300000.mmc: mmc_debug:3f3f0000
[ 1.302834] mmc-bcm2835 3f300000.mmc: DMA channels allocated
It looks quite slow (for sure it boots much slower than 0x2000). 10 reboots: no crash
To be precise: no crash means I stopped the testing by hand and thus no corruption took place. I used the following test command:
RUN=0; while $(true); do RUN=$[$RUN+1]; echo Reboot cycle $RUN; sshpass -p "openelec" ssh root@192.168.0.60 '({ sleep 2; reboot; } >/dev/null &) ; exit ' ; sleep 10; CHK=1; while [[ $CHK -eq 1 ]]; do echo Checking if back; sleep 1; (ping -c 1 -t 1 192.168.0.60 2>&1 > /dev/null) && CHK=0; done; echo Openelec is back; sleep 20; done
bcm2835_mmc.mmc_debug=0xffff0000
Painfully slow, no corruption or complete freezes (over 10 reboots) but unusable really (five minute from rainbow to interface).bcm2835_mmc.mmc_debug=0x2000
No freezes or corruption (14 reboots), 13 seconds between rainbow pixels and OpenElec main interface, used for quite a while yesterday and the interface felt finebcm2835_mmc.mmc_debug=0x7f7f0000
No freezes or corruption (14 reboots), 51 seconds between rainbow pixels and OpenElec main interface (one timing but all reboots felt appreciable slower than ’0x2000` and those below), interface was occasionally a little lumpy (scrolling between menu items would pause for a moment every now and again, perhaps one in 10 pings would take around 500ms instead of sub-5ms).bcm2835_mmc.mmc_debug=0x3f3f0000
Not done in-depth reboots for corruption tests yet, 15 seconds between rainbow pixels and OpenElec main interface (one try), not tested interface really but felt OK.bcm2835_mmc.mmc_debug=0x1f1f0000
Not done in-depth reboots for corruption tests yet, 13 seconds between rainbow pixels and OpenElec main interface (one try), interface feels responsive, all pings sub-20ms, most sub-5ms.Do you need/want repeated reboot tests with 0x3f3f0000
or 0x1f1f0000
?
A
I'd like to find out the smallest delays that avoid the corruption.
Assuming 0x1f1f0000
doesn't corrupt, then continue with 0x0f0f0000
, 0x08080000
, 0x04040000
, 0x02020000
, 0x01010000
. The lower numbers will be better performance, but I'd imagine at some point you'll start seeing corruption. Hopefully it will be at a small enough number that performance isn't measurably affected.
@popcornmix how would you suggest measuring the performance hit? Are you thinking in terms of data throughput or overall system responsiveness? I suspect there will be a big difference between the two, especially where one person is predominantly streaming (like me) compared to another who is playing media from the SD card itself.
A
As long as we get below the 0x0f0f0000
numbers I suspect it won't be significant (that is 15us per sdcard host control register access - the actual data goes over dma, so there will only be a few of those per sector). Just finding that smallest values that don't corrupt is the key piece of information.
We can then do performance tests under raspbian (e.g. sudo hdparm -t /dev/mmcblk0
or Bonnie++) to be sure, but I suspect it won't be an issue.
Thanks @popcornmix I will get back to you later tonight I hope with data about potential corruption.
0x1f1f0000
and 0x0f0f0000
went fine, 20 reboots, no corruption, md5sum of the /flash and /storage folder structures showed no files had altered unexpectedly (i.e. beyond log files, and other files which are modified on boot etc).
0x08080000
appears to have caused problems following the fourth reboot in a row. The power LED is on solid (as you'd expect), the activity light was mostly solid with the occasional flicker and remained like that for around four minutes, now the activity light has gone off leaving just the power LED. No output on the HDMI at all. I'll investigate and report back further...
...manually power cycled and it back without issue. Looking around the filesystem showed no damage. Restarted the reboot cycles and the next reboot froze at the initial OpenElec screen (OpenELEC (your) - Version: 5.0.8 [Build #0418]). Waiting to see if it moves on, remains as it is or goes dark...
...remained frozen at that stage. Power cycled and it came back, no harm apparent. Another three reboots and it froze again. Power cycled and it came back again with no harm apparent. After this it completed the remaining reboot cycles without freezes to bring it to 20 in total and at the end of them showed not altered files through the md5sum check. I've popped the SD card out to run an fsck using a separate RapPi to check for any unseen issues and none were found. On with 0x04040000
:)
A
0x04040000
isn't looking good. Two reboots and it's frozen on the initial OpenElec screen. Two subsequent cold boots and it still won't get beyond there (not even to a debug console for me to run fsck et al).
Set debugging
and break=load_modules
and got to a debug console. fsck showed no signs of trouble. Edited cmdline.txt
to read,
boot=/dev/mmcblk0p1 disk=/dev/mmcblk0p2 bcm2835_mmc.mmc_debug=0x04040000 debugging nosplash progress
rebooted and got to OpenElec without issue. No signs of unexpected filesystem change (comparing md5sums). Restarted the reboot cycles with cmdline.txt
as shown above (so I could monitor progress) and the next reboot froze at Starting Kodi sources Setup...
.
Alec
@MilhouseVH - thanks for the tip about adding debugging
to cmdline.txt, worked a treat (but you know it would already of course ;-)
A
Hi,
I was wondering what is the best test method. As far as I understand, we did not yet identify the root cause. Thus it is likely we'll find a setting that is fast and reliable in our tests, but will still corrupt in a week or a month time. Or am I to pessimistic?
Four more reboots with 0x04040000
(total 8 so far) and a freeze, cold boot and got to OpenElec, next reboot froze, cold boot and it got to OpenElec as expected. From here reboots to bring me to 19 completed passed without issue but the 20th froze at Starting Kodi hacks...
and required a manual power cycle after which it booted to Kodi and showed no signs of corruption (according to fsck on another RasPi).
0x02020000
testing underway (with debugging
, nosplash
and progress
enabled to make it easier to monitor progress).
A
When you decide which is the lowest setting that is reliable, perhaps 0x0f0f0000, can you try 0x0f000000 and 0x000f0000. That will determine which if the two places the delay is inserted it the critical one.
0x02020000
froze (at Starting Kodi sources setup...
) on the 11th and 17th reboots. A cold boot fixed it after the 11th reboot but the 17th was fatal and needed fairly extensive fsck work on another Pi (log kept if you're interested).
To be safe I've re-imaged and restored from backup before carrying out the 0x0f0f0000
and 0x0f000000
testing (underway now).
Is this issue only likely to manifest itself when using a vulnerable mini-SD card in the on-board mini-SD card slot or could we see the same sort of corruption when using a vulnerable mini-SD card in a USB based mini-SD card reader? I'm assuming its also limited to the RasPi 2 model?
Alec
I wouldn't expect to see problems with that sdcard in a USB adapter. It's an issue with the bcm2835-mmc driver and certain sdcards. We suspect that the problem is worse on Pi2 due to the higher speed allowing sdcard accesses to occur closer together.
OK, with 0x0f0f0000
I got a freeze after 4 reboots which was not fixable (e2fsck reported the journal version was not supported by this e2fsck). Re-imaged and soon to be trying 0x000f0000
.
A
0x000f0000
gave issues too so I moved back to 0x0f0f0000
to re-test it and that caused issues too after a few reboots. I've retracted even further to 0x1f1f0000
and that has been stable over 30 reboots or so now. I'll leave it rebooting overnight (with a five minute delay between them this time rather than the 30 seconds or so I've been using) and see if it remains stable over a longer period.
I wonder how many years I've in effect added to my mini-SD card and RasPi's lives with all this testing, rebooting etc. etc. (given they're probably not made for rebooting quite as frequently as I've been doing during this testing)...
A
I've been watching with interest the progress that's being made in this thread - very impressive amount of debugging effort you're putting in @AlecEdworthy :-)
@popcornmix When he finds the "optimal" setting, will that then fix it for all Pis and (micro)SD-cards, or is it possible that different cards from different manufacturers will fail / succeed with different debug (delay) values?
OK, so overnight my RasPi carried out over 80 reboots without issue with bcm2835_mmc.mmc_debug=0x1f1f0000
so given the instability (no corruption but freezes on boot) I've seen with lower values this leads me to suggest that this currently as low as we can go for stability.
EDIT: In for a penny, in for a pound, I've taught my other RasPi how to remotely reboot my OpenElec RasPi and set it on the 5 minute reboot cycle with the OpenElec RasPi running with mmc_debug:f0f0000
to see if a more conservative reboot cycle causes fewer issues while I'm out at work. It automatically stops the cycle if the RasPi takes more than three minutes to come back.
EDIT2: Well that was short lived. Two reboots and it froze part way through the start up. I stick by my original statement, bcm2835_mmc.mmc_debug=0x1f1f0000
is as far as we can go and maintain (perfect?) stability.
A
@lurch at the moment we're still gathering information. These debug delays won't necessarily be the final fix. The fact that delays help suggests it is not a logical bug, but is probably a timing bug where we are doing something some sdcards do not like (perhaps violating the minimum delay between a cmd X and cmd Y being sent to sdcard).
If we understand the problem fully, then we will likely know the exact delay required, so it should work for all sdcards. If we don't then we'll have to go with a fix that cures all the tested sdcards and if other sdcards have issues in the future we may need further tweaking. Obviously the more users that can help test now the better.
@AlecEdworthy your Pi won't suffer any ill effect from frequent rebooting. The sdcard does have a limited number of writes lifespan, but this is likely to be of the order of 100k. I'm hoping we'll resolve this issue long before that is approached...
Thanks for the reassurance @popcornmix. Not too worried about the SD card but I was starting to wonder if the repeated warm restarts (and the less frequent cold power ons) might take their toll on the Pi from a sort of maximum actuations (power-up, power-down, power-up, power-down...) point of view (thinking of it like a switch). Again I'm guessing it's rated in a hundreds of thousands if not millions of cycles before any real chance of failure.
Am I right in thinking that I've got about as far as I can with the testing for now @popcornmix?
On a related note, does the boot option debugging
have any effect on the Pi and its running beyond enabling the break=
boot options? i.e. does putting debugging
in cmdline.txt (without any break=
option) have a potential to affect testing beyond just allowing you to review the progress of the boot sequence? I know progress
and nosplash
provide that access too but I wasn't sure if it was necessary to remove debugging
when trying to match normal running conditions as closely as possible? I've tended to leave it in to avoid having to put it back in each time I needed to break the boot sequence out in order to do fsck'ing etc. but perhaps I should have removed it each time along with the break=
option...
A
@AlecEdworthy I'd still like to know if 0x1f000000
or 0x001f0000
is reliable. I suspect only one of the delays is required.
On a related note, does the boot option debugging have any effect on the Pi and its running beyond enabling the break= boot options?
It will cause debug information to be logged in journalctl
and kodi.log
.
@MilhouseVH Thanks for clarifying that, I'll drop it unless I need the break=
options then.
@popcornmix I know what I'll have my Pi doing while revising my PRINCE2 learning tonight then... ;-)
OK, 0x1f000000
survived 6 reboots before sufficient corruption to cause fsck to be unable to fix /dev/mmcblk0p2 (/storage).
A
0x001f0000
has just survived 20 sequential reboots without issue. @popcornmix should I keep testing this value with more reboots or is there refinement to the value you'd prefer testing?
A
I think 20 good reboots sounds like enough.
You could try whittling down 0x1f000000
a little (e.g. 0x18000000
and then 0x14000000
if the first one works, and 0x1c000000
if it fails), but that's not critical.
I've got some extra tests to narrow it further, but that needs a new OE build. I'll make the changes and kick that offf...
@popcornmix I assume you mean whittle down 0x00f10000
(which works OK after 44 reboots now) rather than 0x1f000000
(which corrupted after 6)?
Correct.
@popcornmix Cool, though it would help if I could get my 1's and f's around the right way too, 0x001f0000
I meant ;-)
Links in OE forum have been updated to a new build. New debug option bcm2835_mmc.mmc_debug2
added to disable some of the delays.
There are 10 calls to the write register function that calls the delay function. You disable the delay by setting bits in mmc_debug2. I'm hoping only one delay is actually required.
So, if you could sanity check:
bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x0
should behave well and
bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x3ff
should corrupt. (all delays disabled)
Confirm that is true. You should be able to binary chop the bits in mmc_debug2.
E.g. try 0x1f
. If that corrupts, clear some bits, e.g. 0x3
. If it succeeds set some bits, e.g. 0x7f
. Ideally in four iterations you will find a one-bit change that switches from corrupting to not-corrupting.
If you are unclear what I'm asking, then try 0x1f
and let me know the outcome and I'll suggest the next value to try.
@popcornmix been trying values of mmc_debug
and have come to the conclusion that,
0x001f0000
is good0x00180000
caused issues (lock ups without corruption after some reboots, very intermittent though, first and third locked up, fourth onwards to 20th were fine)0x001c0000
is goodI then started making intermediate values and 0x001b0000
, 0x001a0000
and 0x00190000
were good but 0x00180000
again caused issues.
I'll take a look at the new OE build.
A
@popcornmix The .img link in the OE forum seems to point to the tar file update package not an image, is that deliberate and do you plan to make a .img available please (makes it easier to fix corruption if I can just re-image rather than re-image and then have to update too).
A
Try now.
Thank you! Trying bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x0
now :)
OK, 20 reboots with bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x0
has gone without a hitch. Moving to bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x3ff
to sanity check that end of the scale then I'll start picking intermediate values. I think I understand your comment about binary chopping the bits and finding a solution in four iterations :)
Sanity check confirmed,
bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x0
worked with no issuesbcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x3ff
caused extensive corruption (which fsck was able to correct it says) on first reboot.I'll re-image (to ensure no damaged data remains) and start binary chopping :)
OK, more info,
Sanity checking,
mmc_debug2=0x0
was fine after 20 rebootsmmc_debug2=0x3ff
caused correctable corruption on first reboot, aborted cycle and re-imaged card at this pointBinary chopping,
mmc_debug2=0x1f
caused freezes on reboots 13 and 19 but a cold start got it going again after which it reached 20 reboots and there was no corruption detected during the testmmc_debug2=0x3
was fine after 20 rebootsmmc_debug2=0x7
(one step up from 0x3
by my calculations) froze on reboot 15 but a cold start got it going again after which it reached 20 reboots and there was no corruption detected during the test. On a new cycle of 20 reboots mmc_debug2=0x7
has frozen on reboots 4 and 19 but a cold start has got it going again. However after reboot 19 there was sufficient corruption to stop the Pi booting normally, but fsck was able tot fix it.Given mmc_debug2=0x3
seems the highest value we can achieve without freezes I'll re-image the card again and then kick off an overnight reboot test with this value (unless you suggest other testing instead/first).
Alec
Overnight testing with bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x3
resulted in corruption after 106 cycles. I added in an SD card performance test (write 500MB file, read 500MB file, delete 500MB file) which saw a consistent 10MB/sec write and 16MB/sec read and added an additional 30 second pause leaving the cycle time at 2m30s approximately. I've kicked off the same test using bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x0
to soak it instead.
A
bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x0
did run fine for a 4 hours reboot-loop. I did not count the amount of reboots though, should have been an awful lot however.
bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x3ff
froze during boot three times, didn't try any further then.
bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x1f
did get a bit further than 0x3ff
but freezes before being able to mount the file systems in all three attempted boots.
I must admit I'm not really up to par in chopping binarys, further instructions on what exactly I might be testing next would be helpful :) Especially as my card / Pi2 seems to be a bit more picky then Alec's..
We know is all bits of mmc_debug2 are zero, then things are good and if all bits of mmc_debug2 are ones then it corrupts.
We want to find the value of mmc_debug2 that works well, and has the most bits set. I'm hoping only one of the delays is required, which would mean one of the values with 9 bits set.
if 0x1f
is a failure, then you could sanity check the inverse of that which should pass. Try 0x3e0
I've had 153 successful reboots under bcm2835_mmc.mmc_debug2=0x0
so I am sure that is safe as houses.
Given I have had issues with bcm2835_mmc.mmc_debug2=0x3
(so in binary terms 0000000011) I have set bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x2
(so 0000000010) to see if I can determine which bit of these gave me problems. Do we then need to start slicing and dicing the higher bits (e.g. setting 0x1111000000 aka 0x3c0 etc)?
A
@Cy4n1d3 What make and model SD card are you using? Mine is a Transcend Ultimate 16GB Micro SD (SDHC) Card, man:0x000074 oem:0x4a45 name:USD hwrev:0x0 fwrev:0x2
That list of details can be obtained using the commands,
cd /sys/class/mmc_host/mmc?/mmc?:*
echo "man:$(cat manfid) oem:$(cat oemid) name:$(cat name) hwrev:$(cat hwrev) fwrev:$(cat fwrev)"
Alec
@AlecEdworthy
Given I have had issues with bcm2835_mmc.mmc_debug2=0x3 (so in binary terms 0000000011) I have set bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x2 (so 0000000010) to see if I can determine which bit of these gave me problems. Do we then need to start slicing and dicing the higher bits (e.g. setting 0x1111000000 aka 0x3c0 etc)?
If you believe 0x3
is bad, then it's worth checking the inverse 0x3fc
is good. If they are both bad then it suggests there is more than one place that needs the delay.
bcm2835_mmc.mmc_debug=0x001f0000 bcm2835_mmc.mmc_debug2=0x3e0
does at least boot up in most of my attempts so far but still had a freeze on boot for me.
Couldn't reproduce the freeze in ~20 reboots though.
@AlecEdworthy - I'm running a Samsung 16GB Evo Cl10 UHS-1 for these tests.
These are the details, thanks for the c&p commands :)
man:0x00001b oem:0x534d name:00000 hwrev:0x1 fwrev:0x0
Lots of people seems to be affected by an issue with sdcard. I'm using a Samsung Evo 16 Gb micro SDCard, and using raspbian, I encounter every time corruption on the sd card. It's easy to reproduce :
I need to check on my linux system (I'm at work) if I can fix the card at this step or not. I flashed the raspbian img multiple times and it doesn't work.
It also can be reproduce just by making the RPI reboot multiple times through the terminal (using sudo reboot)
I have 3 of them so I hope it's just a firmware bug (I'll try with another one to be sure it' sno the sdcard itself)
Here's two threaeds that gathered this issue (without creating a post in here though :/ ) : http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=101183&p=703772&hilit=error+110#p703772 http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=98935
One post is interesting :
If you want card info, I can give them but you need to tell what to run on which system, because I didn't find a way to print sdcard charateristics from Mac OS X.