mchehab / rasdaemon

Rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists.
GNU General Public License v2.0
178 stars 79 forks source link

rasdaemon: ras-mc-ctl --error-count random sorts output #145

Open walterav1984 opened 8 months ago

walterav1984 commented 8 months ago

Noticed this random sorting behavior of dimm numbers and channel/riser locations for a while but also verified git master version f9cb13b of 2024-02-05 which shows the same behavior. Not sure if its bug or a feature, but this happens on all tested machines (DQ57TM and MacPro 1,1 2,1 3,1) running repository versions 0.68 of debian 12 /ubuntu 23.10 and f9cb13b.

Its independent of the fact labels are used and or registered.

$ sudo ras-mc-ctl --error-count #1
Label       CE  UE
DIMM2_RA    0   0
DIMM2_RB    0   0
DIMM1_RB    0   0
DIMM1_RA    0   0
DIMM4_RA    0   0
DIMM4_RB    0   0
DIMM3_RB    0   0
DIMM3_RA    0   0
$ sudo ras-mc-ctl --error-count #2
Label       CE  UE
DIMM1_RB    0   0
DIMM4_RA    0   0
DIMM1_RA    0   0
DIMM4_RB    0   0
DIMM3_RA    0   0
DIMM2_RB    0   0
DIMM3_RB    0   0
DIMM2_RA    0   0
$ sudo ras-mc-ctl --error-count #3
Label       CE  UE
DIMM3_RA    0   0
DIMM1_RB    0   0
DIMM2_RB    0   0
DIMM4_RA    0   0
DIMM1_RA    0   0
DIMM3_RB    0   0
DIMM4_RB    0   0
DIMM2_RA    0   0
$ sudo ras-mc-ctl --error-count #4
Label       CE  UE
DIMM2_RA    0   0
DIMM4_RB    0   0
DIMM3_RA    0   0
DIMM1_RB    0   0
DIMM3_RB    0   0
DIMM1_RA    0   0
DIMM2_RB    0   0
DIMM4_RA    0   0

$ sudo ras-mc-ctl --error-count | sort
DIMM1_RA    0   0
DIMM1_RB    0   0
DIMM2_RA    0   0
DIMM2_RB    0   0
DIMM3_RA    0   0
DIMM3_RB    0   0
DIMM4_RA    0   0
DIMM4_RB    0   0
Label       CE  UE

$ sudo ras-mc-ctl --error-count #1
Label                       CE  UE
mc#0branch#0channel#1slot#0 0   0
mc#0branch#1channel#0slot#0 0   0
mc#0branch#0channel#0slot#1 0   0
mc#0branch#1channel#1slot#1 0   0
mc#0branch#1channel#1slot#0 0   0
mc#0branch#0channel#0slot#0 0   0
mc#0branch#1channel#0slot#1 0   0
mc#0branch#0channel#1slot#1 0   0

$ sudo ras-mc-ctl --error-count #2
Label                       CE  UE
mc#0branch#1channel#0slot#1 0   0
mc#0branch#0channel#0slot#0 0   0
mc#0branch#0channel#0slot#1 0   0
mc#0branch#1channel#0slot#0 0   0
mc#0branch#1channel#1slot#1 0   0
mc#0branch#0channel#1slot#0 0   0
mc#0branch#0channel#1slot#1 0   0
mc#0branch#1channel#1slot#0 0   0

$ sudo ras-mc-ctl --error-count #3
Label                       CE  UE
mc#0branch#0channel#1slot#1 0   0
mc#0branch#1channel#0slot#0 0   0
mc#0branch#0channel#0slot#1 0   0
mc#0branch#1channel#1slot#0 0   0
mc#0branch#1channel#1slot#1 0   0
mc#0branch#0channel#0slot#0 0   0
mc#0branch#1channel#0slot#1 0   0
mc#0branch#0channel#1slot#0 0   0

$ sudo ras-mc-ctl --error-count #4
Label                       CE  UE
mc#0branch#0channel#0slot#0 0   0
mc#0branch#1channel#0slot#0 0   0
mc#0branch#1channel#1slot#0 0   0
mc#0branch#0channel#1slot#0 0   0
mc#0branch#1channel#0slot#1 0   0
mc#0branch#0channel#0slot#1 0   0
mc#0branch#0channel#1slot#1 0   0
mc#0branch#1channel#1slot#1 0   0

Compared to --guess-labels and --print-labels which use their own unique but fixed pattern.

$ sudo ras-mc-ctl --guess-labels
memory stick 'DIMM 1' is located at 'DIMM Riser A'
memory stick 'DIMM 2' is located at 'DIMM Riser A'
memory stick 'DIMM 1' is located at 'DIMM Riser B'
memory stick 'DIMM 2' is located at 'DIMM Riser B'
memory stick 'DIMM 3' is located at 'DIMM Riser A'
memory stick 'DIMM 4' is located at 'DIMM Riser A'
memory stick 'DIMM 3' is located at 'DIMM Riser B'
memory stick 'DIMM 4' is located at 'DIMM Riser B'

$ sudo ras-mc-ctl --print-labels #edited labels but not registered yet
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS      
mc0 branch 0 channel 0 slot 0       DIMM1_RA             mc#0branch#0channel#0slot#0
mc0 branch 0 channel 0 slot 1       DIMM3_RA             mc#0branch#0channel#0slot#1
mc0 branch 0 channel 1 slot 0       DIMM2_RA             mc#0branch#0channel#1slot#0
mc0 branch 0 channel 1 slot 1       DIMM4_RA             mc#0branch#0channel#1slot#1
mc0 branch 1 channel 0 slot 0       DIMM1_RB             mc#0branch#1channel#0slot#0
mc0 branch 1 channel 0 slot 1       DIMM3_RB             mc#0branch#1channel#0slot#1
mc0 branch 1 channel 1 slot 0       DIMM2_RB             mc#0branch#1channel#1slot#0
mc0 branch 1 channel 1 slot 1       DIMM4_RB             mc#0branch#1channel#1slot#1

$ sudo ras-mc-ctl --register-labels
$ sudo ras-mc-ctl --print-labels
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS      
mc0 branch 0 channel 0 slot 0       DIMM1_RA             DIMM1_RA            
mc0 branch 0 channel 0 slot 1       DIMM3_RA             DIMM3_RA            
mc0 branch 0 channel 1 slot 0       DIMM2_RA             DIMM2_RA            
mc0 branch 0 channel 1 slot 1       DIMM4_RA             DIMM4_RA            
mc0 branch 1 channel 0 slot 0       DIMM1_RB             DIMM1_RB            
mc0 branch 1 channel 0 slot 1       DIMM3_RB             DIMM3_RB            
mc0 branch 1 channel 1 slot 0       DIMM2_RB             DIMM2_RB            
mc0 branch 1 channel 1 slot 1       DIMM4_RB             DIMM4_RB