jsk-ros-pkg / jsk_robot

jsk-ros-pkg/jsk_robot
https://github.com/jsk-ros-pkg/jsk_robot
73 stars 97 forks source link

[PR1012] PR1012のC1が突然おちることがある #1776

Open knorth55 opened 1 year ago

knorth55 commented 1 year ago

PR1012のC1が突然おちることがある. 電源ランプはついたままだが,sshや通信などができなくなる. 今日もcatkin build中に発生していて,/var/log/syslogは以下のようになっている. メモリのハードウェアエラーのよう?

cc. @nakane11 @iory

Feb 17 11:45:01 pr1012 CRON[16036]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 17 11:45:36 pr1012 kernel: [149396.111599] {48}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Feb 17 11:45:36 pr1012 kernel: [149396.111603] {48}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 17 11:45:36 pr1012 kernel: [149396.111618] {48}[Hardware Error]: event severity: corrected
Feb 17 11:45:36 pr1012 kernel: [149396.111620] {48}[Hardware Error]:  Error 0, type: corrected
Feb 17 11:45:36 pr1012 kernel: [149396.111621] {48}[Hardware Error]:  fru_text: CorrectedErr
Feb 17 11:45:36 pr1012 kernel: [149396.111622] {48}[Hardware Error]:   section_type: memory error
Feb 17 11:45:36 pr1012 kernel: [149396.111624] {48}[Hardware Error]:   node: 1 device: 0
Feb 17 11:45:36 pr1012 kernel: [149396.111625] {48}[Hardware Error]:   error_type: 2, single-bit ECC
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
nakane11 commented 1 year ago

今日も同じエラーがおきました

一回目

Feb 28 21:33:51 pr1012 kernel: [107328.679342] IPMI message handler: BMC returned incorrect response, expected netfn 2d cmd 0, got netfn 2d cmd 1
Feb 28 21:34:47 pr1012 kernel: [107385.111954] {20}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Feb 28 21:34:47 pr1012 kernel: [107385.111957] {20}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 28 21:34:47 pr1012 kernel: [107385.111959] {20}[Hardware Error]: event severity: corrected
Feb 28 21:34:47 pr1012 kernel: [107385.111961] {20}[Hardware Error]:  Error 0, type: corrected
Feb 28 21:34:47 pr1012 kernel: [107385.111962] {20}[Hardware Error]:  fru_text: CorrectedErr
Feb 28 21:34:47 pr1012 kernel: [107385.111964] {20}[Hardware Error]:   section_type: memory error
Feb 28 21:34:47 pr1012 kernel: [107385.111965] {20}[Hardware Error]:   node: 1 device: 0 
Feb 28 21:34:47 pr1012 kernel: [107385.111966] {20}[Hardware Error]:   error_type: 2, single-bit ECC
Feb 28 21:35:01 pr1012 CRON[21138]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 28 21:35:05 pr1012 kernel: [107402.577163] IPMI message handler: BMC returned incorrect response, expected netfn 5 cmd 2d, got netfn b cmd 23
Feb 28 21:35:13 pr1012 kernel: [107410.839751] IPMI message handler: BMC returned incorrect response, expected netfn b cmd 23, got netfn 20 cmd 7
Feb 28 21:35:13 pr1012 kernel: [107410.954524] IPMI message handler: BMC returned incorrect response, expected netfn b cmd 23, got netfn b cmd 0
Feb 28 21:35:19 pr1012 kernel: [107416.788211] IPMI message handler: BMC returned incorrect response, expected netfn b cmd 23, got netfn 23 cmd fc
Feb 28 21:35:34 pr1012 kernel: [107431.405958] IPMI message handler: BMC returned incorrect response, expected netfn 5 cmd 2d, got netfn b cmd 23
Feb 28 21:35:47 pr1012 kernel: [107445.388922] {21}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Feb 28 21:35:47 pr1012 kernel: [107445.388926] {21}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 28 21:35:47 pr1012 kernel: [107445.388927] {21}[Hardware Error]: event severity: corrected
Feb 28 21:35:47 pr1012 kernel: [107445.388928] {21}[Hardware Error]:  Error 0, type: corrected
Feb 28 21:35:47 pr1012 kernel: [107445.388929] {21}[Hardware Error]:  fru_text: CorrectedErr
Feb 28 21:35:47 pr1012 kernel: [107445.388930] {21}[Hardware Error]:   section_type: memory error
Feb 28 21:35:47 pr1012 kernel: [107445.388931] {21}[Hardware Error]:   node: 1 device: 0 
Feb 28 21:35:47 pr1012 kernel: [107445.388932] {21}[Hardware Error]:   error_type: 2, single-bit ECC
Feb 28 21:36:48 pr1012 kernel: [107506.435380] {22}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Feb 28 21:36:48 pr1012 kernel: [107506.435383] {22}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 28 21:36:48 pr1012 kernel: [107506.435385] {22}[Hardware Error]: event severity: corrected
Feb 28 21:36:48 pr1012 kernel: [107506.435386] {22}[Hardware Error]:  Error 0, type: corrected
Feb 28 21:36:48 pr1012 kernel: [107506.435388] {22}[Hardware Error]:  fru_text: CorrectedErr
Feb 28 21:36:48 pr1012 kernel: [107506.435389] {22}[Hardware Error]:   section_type: memory error
Feb 28 21:36:48 pr1012 kernel: [107506.435390] {22}[Hardware Error]:   node: 1 device: 0 
Feb 28 21:36:48 pr1012 kernel: [107506.435391] {22}[Hardware Error]:   error_type: 2, single-bit ECC
Feb 28 21:37:48 pr1012 kernel: [107566.455847] {23}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Feb 28 21:37:48 pr1012 kernel: [107566.455863] {23}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 28 21:37:48 pr1012 kernel: [107566.455865] {23}[Hardware Error]: event severity: corrected
Feb 28 21:37:48 pr1012 kernel: [107566.455866] {23}[Hardware Error]:  Error 0, type: corrected
Feb 28 21:37:48 pr1012 kernel: [107566.455867] {23}[Hardware Error]:  fru_text: CorrectedErr
Feb 28 21:37:48 pr1012 kernel: [107566.455869] {23}[Hardware Error]:   section_type: memory error
Feb 28 21:37:48 pr1012 kernel: [107566.455870] {23}[Hardware Error]:   node: 1 device: 0 
Feb 28 21:37:48 pr1012 kernel: [107566.455871] {23}[Hardware Error]:   error_type: 2, single-bit ECC
Feb 28 21:38:04 pr1012 kernel: [107581.710983] IPMI message handler: BMC returned incorrect response, expected netfn 5 cmd 2d, got netfn b cmd 23
Feb 28 21:38:43 pr1012 kernel: [107621.600387] IPMI message handler: BMC returned incorrect response, expected netfn b cmd 23, got netfn b cmd 2d
Feb 28 21:38:49 pr1012 kernel: [107626.878056] {24}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Feb 28 21:38:49 pr1012 kernel: [107626.878059] {24}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 28 21:38:49 pr1012 kernel: [107626.878060] {24}[Hardware Error]: event severity: corrected
Feb 28 21:38:49 pr1012 kernel: [107626.878062] {24}[Hardware Error]:  Error 0, type: corrected
Feb 28 21:38:49 pr1012 kernel: [107626.878063] {24}[Hardware Error]:  fru_text: CorrectedErr
Feb 28 21:38:49 pr1012 kernel: [107626.878064] {24}[Hardware Error]:   section_type: memory error
Feb 28 21:38:49 pr1012 kernel: [107626.878065] {24}[Hardware Error]:   node: 1 device: 0 
Feb 28 21:38:49 pr1012 kernel: [107626.878066] {24}[Hardware Error]:   error_type: 2, single-bit ECC
Feb 28 21:38:49 pr1012 kernel: [107627.116635] IPMI message handler: BMC returned incorrect response, expected netfn 2d cmd 0, got netfn 3a cmd 8
Feb 28 21:47:50 pr1012 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="635" x-info="http://www.rsyslog.com"] start
Feb 28 21:47:50 pr1012 rsyslogd-2307: warning: ~ action is deprecated, consider using the 'stop' statement instead [try http://www.rsyslog.com/e/2307 ]
Feb 28 21:47:50 pr1012 rsyslogd-2307: message repeated 2 times: [warning: ~ action is deprecated, consider using the 'stop' statement instead [try http://www.rsyslog.com/e/2307 ]]
Feb 28 21:47:50 pr1012 rsyslogd: rsyslogd's groupid changed to 104
Feb 28 21:47:50 pr1012 rsyslogd: rsyslogd's userid changed to 101
Feb 28 21:47:50 pr1012 kernel: [    0.000000] Initializing cgroup subsys cpuset
Feb 28 21:47:50 pr1012 kernel: [    0.000000] Initializing cgroup subsys cpu
Feb 28 21:47:50 pr1012 kernel: [    0.000000] Initializing cgroup subsys cpuacct
Feb 28 21:47:50 pr1012 kernel: [    0.000000] Linux version 3.19.0-49-lowlatency (buildd@lgw01-08) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #55~14.04.1-Ubuntu SMP PREEMPT Fri Jan 22 13:21:43 UTC 2016 (Ubuntu 3.19.0-49.55~14.04.1-lowlatency 3.19.8-ckt12)
Feb 28 21:47:50 pr1012 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.19.0-49-lowlatency root=UUID=76fe5b88-70bc-48ad-ae51-40b76b0412c4 ro biosdevname=0 net.ifnames=0 console=prolific,115200n8 console=tty1 security=selinux selinux=1
Feb 28 21:47:50 pr1012 kernel: [    0.000000] KERNEL supported cpus:
Feb 28 21:47:50 pr1012 kernel: [    0.000000]   Intel GenuineIntel
Feb 28 21:47:50 pr1012 kernel: [    0.000000]   AMD AuthenticAMD
Feb 28 21:47:50 pr1012 kernel: [    0.000000]   Centaur CentaurHauls
Feb 28 21:47:50 pr1012 kernel: [    0.000000] e820: BIOS-provided physical RAM map:
Feb 28 21:47:50 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000057fff] usable
Feb 28 21:47:50 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000058000-0x0000000000058fff] reserved
Feb 28 21:47:50 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000059000-0x000000000009ffff] usable
Feb 28 21:47:50 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000080a9ffff] usable
Feb 28 21:47:50 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000080aa0000-0x0000000080aa0fff] ACPI NVS

二回目

Feb 28 21:52:07 pr1012 kernel: [  262.449653] ata1.00: status: { DRDY }
Feb 28 21:52:07 pr1012 kernel: [  262.449654] ata1.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:52:07 pr1012 kernel: [  262.449656] ata1.00: cmd 61/08:e0:70:0c:4c/00:00:47:00:00/40 tag 28 ncq 4096 out
Feb 28 21:52:07 pr1012 kernel: [  262.449656]          res 40/00:90:68:07:4c/00:00:47:00:00/40 Emask 0x10 (ATA bus error)
Feb 28 21:52:07 pr1012 kernel: [  262.449657] ata1.00: status: { DRDY }
Feb 28 21:52:07 pr1012 kernel: [  262.449657] ata1.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:52:07 pr1012 kernel: [  262.449659] ata1.00: cmd 61/08:e8:80:0c:4c/00:00:47:00:00/40 tag 29 ncq 4096 out
Feb 28 21:52:07 pr1012 kernel: [  262.449659]          res 40/00:90:68:07:4c/00:00:47:00:00/40 Emask 0x10 (ATA bus error)
Feb 28 21:52:07 pr1012 kernel: [  262.449660] ata1.00: status: { DRDY }
Feb 28 21:52:07 pr1012 kernel: [  262.449660] ata1.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:52:07 pr1012 kernel: [  262.449662] ata1.00: cmd 61/08:f0:a8:0c:4c/00:00:47:00:00/40 tag 30 ncq 4096 out
Feb 28 21:52:07 pr1012 kernel: [  262.449662]          res 40/00:90:68:07:4c/00:00:47:00:00/40 Emask 0x10 (ATA bus error)
Feb 28 21:52:07 pr1012 kernel: [  262.449663] ata1.00: status: { DRDY }
Feb 28 21:52:07 pr1012 kernel: [  262.449665] ata1: hard resetting link
Feb 28 21:52:08 pr1012 kernel: [  262.755161] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Feb 28 21:52:08 pr1012 kernel: [  262.755429] ata1.00: supports DRM functions and may not be fully accessible
Feb 28 21:52:08 pr1012 kernel: [  262.755657] ata1.00: disabling queued TRIM support
Feb 28 21:52:08 pr1012 kernel: [  262.756437] ata1.00: supports DRM functions and may not be fully accessible
Feb 28 21:52:08 pr1012 kernel: [  262.756612] ata1.00: disabling queued TRIM support
Feb 28 21:52:08 pr1012 kernel: [  262.756961] ata1.00: configured for UDMA/133
Feb 28 21:52:08 pr1012 kernel: [  262.757073] ahci 0000:00:17.0: port does not support device sleep
Feb 28 21:52:08 pr1012 kernel: [  262.757116] ata1: EH complete
Feb 28 21:52:35 pr1012 kernel: [  290.205410] usb 1-5: current rate 0 is different from the runtime rate 44100
Feb 28 21:52:35 pr1012 kernel: [  290.205755] usb 1-5: current rate 0 is different from the runtime rate 44100
Feb 28 21:52:35 pr1012 kernel: [  290.209136] usb 1-5: current rate 0 is different from the runtime rate 44100
Feb 28 21:52:35 pr1012 kernel: [  290.209648] usb 1-5: current rate 0 is different from the runtime rate 44100
Feb 28 21:52:35 pr1012 kernel: [  290.220037] usb 1-5: current rate 0 is different from the runtime rate 44100
Feb 28 21:52:35 pr1012 kernel: [  290.220802] usb 1-5: current rate 0 is different from the runtime rate 44100
Feb 28 21:52:35 pr1012 kernel: [  290.223870] usb 1-5: current rate 0 is different from the runtime rate 44100
Feb 28 21:52:35 pr1012 kernel: [  290.224200] usb 1-5: current rate 0 is different from the runtime rate 44100
Feb 28 21:52:35 pr1012 pulseaudio[13300]: [pulseaudio] server-lookup.c: Unable to contact D-Bus: org.freedesktop.DBus.Error.NotSupported: Unable to autolaunch a dbus-daemon without a $DISPLAY for X11
Feb 28 21:52:35 pr1012 pulseaudio[13300]: [pulseaudio] main.c: Unable to contact D-Bus: org.freedesktop.DBus.Error.NotSupported: Unable to autolaunch a dbus-daemon without a $DISPLAY for X11
Feb 28 21:52:35 pr1012 pulseaudio[13300]: [pulseaudio] bluetooth-util.c: org.bluez.Manager.GetProperties() failed: org.freedesktop.DBus.Error.AccessDenied: Rejected send message, 2 matched rules; type="method_call", sender=":1.74" (uid=1001 pid=13300 comm="/usr/bin/pulseaudio --start --log-target=syslog ") interface="org.bluez.Manager" member="GetProperties" error name="(unset)" requested_reply="0" destination="org.bluez" (uid=0 pid=690 comm="/usr/sbin/bluetoothd ")
Feb 28 21:52:35 pr1012 kernel: [  290.256948] usb 1-5: current rate 0 is different from the runtime rate 48000
Feb 28 21:52:35 pr1012 kernel: [  290.257284] usb 1-5: current rate 0 is different from the runtime rate 48000
Feb 28 21:59:55 pr1012 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="640" x-info="http://www.rsyslog.com"] start
Feb 28 21:59:55 pr1012 rsyslogd-2307: warning: ~ action is deprecated, consider using the 'stop' statement instead [try http://www.rsyslog.com/e/2307 ]
Feb 28 21:59:55 pr1012 rsyslogd-2307: message repeated 2 times: [warning: ~ action is deprecated, consider using the 'stop' statement instead [try http://www.rsyslog.com/e/2307 ]]
Feb 28 21:59:55 pr1012 rsyslogd: rsyslogd's groupid changed to 104
Feb 28 21:59:55 pr1012 rsyslogd: rsyslogd's userid changed to 101
Feb 28 21:59:55 pr1012 kernel: [    0.000000] Initializing cgroup subsys cpuset
Feb 28 21:59:55 pr1012 kernel: [    0.000000] Initializing cgroup subsys cpu
Feb 28 21:59:55 pr1012 kernel: [    0.000000] Initializing cgroup subsys cpuacct
Feb 28 21:59:55 pr1012 kernel: [    0.000000] Linux version 3.19.0-49-lowlatency (buildd@lgw01-08) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #55~14.04.1-Ubuntu SMP PREEMPT Fri Jan 22 13:21:43 UTC 2016 (Ubuntu 3.19.0-49.55~14.04.1-lowlatency 3.19.8-ckt12)
Feb 28 21:59:55 pr1012 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.19.0-49-lowlatency root=UUID=76fe5b88-70bc-48ad-ae51-40b76b0412c4 ro biosdevname=0 net.ifnames=0 console=prolific,115200n8 console=tty1 security=selinux selinux=1
Feb 28 21:59:55 pr1012 kernel: [    0.000000] KERNEL supported cpus:
Feb 28 21:59:55 pr1012 kernel: [    0.000000]   Intel GenuineIntel
Feb 28 21:59:55 pr1012 kernel: [    0.000000]   AMD AuthenticAMD
Feb 28 21:59:55 pr1012 kernel: [    0.000000]   Centaur CentaurHauls
Feb 28 21:59:55 pr1012 kernel: [    0.000000] e820: BIOS-provided physical RAM map:
Feb 28 21:59:55 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000057fff] usable
Feb 28 21:59:55 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000058000-0x0000000000058fff] reserved
Feb 28 21:59:55 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000059000-0x000000000009ffff] usable
Feb 28 21:59:55 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000080a9ffff] usable
Feb 28 21:59:55 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000080aa0000-0x0000000080aa0fff] ACPI NVS
Feb 28 21:59:55 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000080aa1000-0x0000000080aeafff] reserved
Feb 28 21:59:55 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000080aeb000-0x00000000855fffff] usable
Feb 28 21:59:55 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x0000000085600000-0x000000008593dfff] reserved
Feb 28 21:59:55 pr1012 kernel: [    0.000000] BIOS-e820: [mem 0x000000008593e000-0x0000000085afdfff] usable

一回目はsudo initctl stop robotを実行中に通信が切れたため再起動し、 二回目は再起動後のキャリブレーションが終わったときです

knorth55 commented 1 year ago

1回目は同じようにメモリーエラーがでているけど,2回目はちょっと違いそうですね. ただメモリを新しいものに変えて試してみるというのはありかもしれません.

knorth55 commented 1 year ago

現在使っているPR2のメモリ: CENTURY MICRO CK16GX2-D4UE2133

現在使っているPR2のメモリの規格: DDR4-2133 ECC Unbuffered DIMM 16GB×2

マザボの型番とメモリ対応: Asrock Rack E3C236D2I Supports DDR4 2400*/2133/1866/1600 ECC/non-ECC** UDIMM memory https://www.asrockrack.com/general/productdetail.jp.asp?Model=E3C236D2I#Specifications

@nakane11 @iory この規格にあうメモリを買って試してみるのがいいと思います. https://docs.google.com/presentation/d/1b-UQShSEY_pswifYKupBrZt8wmQEFET3_E3nsp-V9x8/edit#slide=id.ge9022d6b43_0_10

nakane11 commented 1 year ago

@knorth55 ありがとうございます。

昨日からの作業ログ

knorth55 commented 1 year ago

ディスプレイをつないでBIOSにはいって,Bootの優先順位があっているかを確認するといいと思います. PR2のマザボは一回SSDをはずすと,SSDの優先順位が一番下になって,UEFI shellやHDDで起動しようとしてHangすることがあります.