mchehab / rasdaemon

Rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists.
GNU General Public License v2.0
187 stars 81 forks source link

Rasdaemon 0.6.6 version not logging the trace events from the kernel tracepoints #159

Open prithivi17 opened 7 months ago

prithivi17 commented 7 months ago

As i have been working in rasdaemon lately. I was researching the whole flow of how rasdaemon works from the kernel space to the user space. Since im using debian 11.7 the rasdaemon version available in the os repo was 0.6.6 which seems to be broken. It doesn't captures the trace events of the hardware errors from the trace point of the kernel though the trace events are available in the kernel trace points. So i removed my repo version of rasdaemon and downloaded the 0.8.0 source and compiled the rasdaemon in my server. Now the rasdaemon works fine without any issue and i have found that 0.8.0 version is using libtraceevent to get the traces from the trace point where in 0.6.6 it uses its own libtrace headers. Now the part which i can't understand is when i uninstall my 0.8.0 version of rasdaemon and reinstall my old 0.6.6 repo version of rasdaemon it works!!!! in this case but when i reboot my server it again goes back to the state where it doesn't work. Can someone please explain this behaviour does it cache the 0.8.0 version of functionality in the memory or something like that and is there any fix for rasdaemon 0.6.6 not working as expected.

Sinzunza commented 6 months ago

Hi, I'm having a similar issue. A few questions if you don't mind. I'm on Debian 12 and can't get Rasdaemon to report mce errors. Trying version Debian version 0.8.0-1 still doesn't work.

Any additional configuration you did to get Rasdaemon working? ... Tracing configuration? ... Linux Kernel configuration? You mention all Debian versions don't work, is that including 0.8.0-1?

tai271828 commented 5 months ago

There are newer debian package available. You may want to give it a try to see if you still reproduce the issue.

prithivi17 commented 5 months ago

Hi, actually the point here is the issue is not related to the debian repo version of rasdaemon . The point is rasdaemon 0.6.6 version available is not capturing the tracepoint events. Let me give you all the test that i have performed below,

md5sum 38404619a748b581529095a5a586e289 rasdaemon-0.6.6.zip --------> This is the source for rasdaemon-0.6.6 that i have downloaded from the github repository.

After i compiled the 0.6.6 version , i started the rasdaemon in foreground and record as below, image

After that i initiated the edac-fake-inject error , image

But no error got captured in the 0.6.6 rasdaemon which i ran as the foreground before.

Now i removed the compiled version of 0.6.6 and installed 0.8.0 lastest version and tried the same, image

Now you can see the mc error events are getting captured.

As per my analysis , I confirmed that errors are getting captured in the tracepoints in the kernel space, but 0.6.6 version of rasdaemon didn't capture the events from the tracepoints. as you can clearly see the in the below screenshot that no error is being recorded in the ras-mc-ctl table, image

As I analyzed the commits for the changes, it seems like libraceevent is responsible for capturing the tracepoint event and helps rasdaemon to capture the events. As this is included in the binary of the rasdaemon 0.6.6 source but in rasdaemon 0.8.0 , the code has been changed to use the kernel so file libtracevent.so for capturing the trace events from the tracepoint.

As a workaround, I upgraded the 0.6.6 version to the latest version available in the github repository (i.e.,) rasdaemon 0.8.0. Now the rasdaemon is working fine. Need to know if there is a way to fix this rasdaemon 0.6.6 version to capture the tracepoint. You can update me with the fix if possible, it would be very helpful.

Thanks in advance.

prithivi17 commented 5 months ago

Hi, I'm having a similar issue. A few questions if you don't mind. I'm on Debian 12 and can't get Rasdaemon to report mce errors. Trying version Debian version 0.8.0-1 still doesn't work.

Any additional configuration you did to get Rasdaemon working? ... Tracing configuration? ... Linux Kernel configuration? You mention all Debian versions don't work, is that including 0.8.0-1?

Get the rasdaemon latest src from the github (i.e.,) 0.8.0 version and follow the compilation steps i mention below it will work,

git clone https://github.com/mchehab/rasdaemon.git rm -r /var/lib/rasdaemon/ras-mc_event.db apt-get install make gcc autoconf automake libtool libevent-dev tar libsqlite3-dev libdbd-sqlite3-perl libtraceevent-dev pkg-config - (necassary packages for compilation)

autoreconf -vfi ./configure --enable-all --localstatedir=/var make make install