Closed andersthomson closed 5 years ago
My usual rule of thumb says that seemingly random crashes in MM code are often due to a power supply fault. Is your Pi2 adequately powered? Are the USB devices externally powered?
Start your use case then after a while run vcgencmd get_throttled
to check for undervoltage issues.
If the crashes are more systematic then I would begin to suspect one of the drivers.
The PSU has nice rapi logo on it, and identifies itself as STONTRONICS Model DSA-13PFC-05 FCA 5.1 V, 2.5 A
USB wise, I have two self powered hubs connected to it. I've run vcgencmd get_throttled a few times and always get 0x0 back. It's running at 600 MHz, using the powersave controller.
So it's not power then - thanks, I had to ask.
Looking more closely at the log you can see that it appears to be corruption of a contiguous set of struct page
s:
[93836.025194] page:ba409070 count:0 mapcount:0 mapping: (null) index:0x1
[93836.480394] page:ba409094 count:-1370488635 mapcount:0 mapping:2a50b5b2 index:0x1 compound_mapcount: -1514626482
[93836.963111] page:ba4090b8 count:1387088643 mapcount:0 mapping:a5b8a24d index:0x1
[93837.460506] page:ba4090dc count:1864203433 mapcount:0 mapping:e6b80c29 index:0x1
[93837.984584] page:ba409100 count:2044786721 mapcount:0 mapping:a32773b5 index:0x1
[93838.472386] page:ba409124 count:-1079530538 mapcount:0 mapping:d009fdbd index:0x1 compound_mapcount: -456512508
[93838.969906] page:ba409148 count:941024227 mapcount:0 mapping:e4ca2c03 index:0x1
[93839.450101] page:ba40916c count:1460397259 mapcount:0 mapping:78b8b8b8 index:0x1
[93839.932911] page:ba409190 count:-117901064 mapcount:0 mapping:f8f8f8f8 index:0x1 compound_mapcount: 67372037
[93840.460110] page:ba4091b4 count:-2071690236 mapcount:0 mapping:04040404 index:0x1 compound_mapcount: 0
[93840.826750] page:ba4091d8 count:-1 mapcount:0 mapping:ffffffff index:0x1 compound_mapcount: 0
[93841.222381] page:ba4091fc count:0 mapcount:0 mapping:ffffffff index:0x1 compound_mapcount: 1
This is interesting because struct page
s don't move - they are stored in a single large array allocated at start of day - so something has been writing in the wrong place. A hex dump of the corruption would have been useful, but looking at the values above it is a mix of repeated byte patterns and seemingly random values - perhaps video or audio data?
If it happens again then report back, but until then we don't have enough to go on.
Will do. I've run this box for 2+ years and all but the initial few months using own compiled kernels (gentoo). I've never seen this exact error, but other kernel crashes has been there.
As a tidbit, when I moved to gcc5 (userspace and kernel) I got the perception that I got more segfaults (userspace), and kernel memory stuff such as unaligned accesses in kernelspace and what not. Got me thinking that gcc5@gentoo was not stable for arm. Now I'm back to all gcc4, and those errors are gone (user and kernel space). I'm inclined to move userspace to gcc5 in a few months to see what happens.
I see you use crosstool gcc 4.x. Any reason you're not on gcc5 yet?
Another question I should have asked earlier - are you overclocked? Toolchain changes can reveal different silicon-specific timing issues when running the chip at out-of-spec clock speeds.
I'll let @popcornmix answer the gcc5 question.
gcc 4.9 has been used to match the version debian jessie uses. I'm currently testing gcc 6.4 (which matches debian stretch version) and gcc 7.2 (latest version supported by crosstool-ng). I'll add one or both to tools repo when I'm happy everything works. The latest kernel update (4.9.46) was built with gcc 6.4 and appears to be fine. I'd be very surprised if a stable version of gcc causes panics when building the linux kernel. As @pelwell says, power supplies and overclocking are far more likely causes of crashes.
Hi,
I'm not overclocked. config.txt says:
dtdebug=on kernel=kernel7.img
cpupower frequency-info: analyzing CPU 0: driver: BCM2835 CPUFreq CPUs which run at the same hardware frequency: 0 1 2 3 CPUs which need to have their frequency coordinated by software: 0 1 2 3 maximum transition latency: 355 us. hardware limits: 600 MHz - 900 MHz available frequency steps: 600 MHz, 900 MHz available cpufreq governors: conservative, ondemand, userspace, powersave, performance, schedutil current policy: frequency should be within 600 MHz and 900 MHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency is 600 MHz (asserted by call to hardware). cpufreq stats: 600 MHz:100.00%, 900 MHz:0.00%
If I get more memory related errors I'll look into changing the psu. As for now, the kevent errors are the most common source of lockups.
If I get more memory related errors I'll look into changing the psu.
I would advise against spending money on a new PSU - based on what you've said so far it now seems unlikely to be a power problem. However, transient power glitches may go unnoticed by the firmware, so we can't rule it out.
If you get a similar crash again, one option would be to patch the "Bad page state" error code to print a hex dump of a larger window around the offending struct page
in the hope of being able to identify the source of the corruption.
Thanks. I'll keep that in mind.
Another weird kernel error which might be related. As the kernel paging thing is followed by the rcu stall which causes an immediate reboot. I have no chance to check e.g. the throttling before the reboot. Any chance some of that diagnostic can be added to the oops handling (or the reboot-on-oops handling)? That would make it to the serial log...
[58483.819723] Unable to handle kernel paging request at virtual address f5613015
[58483.831713] pgd = b8b54000
[58483.838913] [f5613015] *pgd=00000000
[58483.846863] Internal error: Oops: 805 [#1] SMP ARM
[58483.855897] Modules linked in: rc_pinnacle_pctv_hd em28xx_rc rc_core si2157 si2168 i2c_mux tda18271 cxd2820r em28xx_dvb dm_mod dvb_core em28xx tveeprom v4l2_common ftdi_sio videodev media usbserial evdev bcm2835_gpiomem uio_pdrv_genirq uio fixed bridge stp llc veth sch_fq_codel nfsd ip_tables x_tables ipv6
[58483.900462] CPU: 0 PID: 40 Comm: kswapd0 Not tainted 4.9.45-v7+ #1031
[58483.911123] Hardware name: BCM2835
[58483.918756] task: b9ef9d80 task.stack: b9f34000
[58483.927604] PC is at es_shrink+0xa8/0x368
[58483.936028] LR is at 0xf5613011
[58483.943236] pc : [<80349854>] lr : [
Closing due to lack of activity. Please request to be reopened if you feel this issue is still relevant.
While streaming TV from tvheadend, I was greeted with this bug. This is on foundation provided 4.9.45 kernel. Never seen this before (on my self-compiled kernels)