splunk / collectd-plugins

SAI x Collectd: https://docs.splunk.com/Documentation/InfraApp/latest/Admin/ManageAgents
13 stars 5 forks source link

processmon fails in case of short-living process disappears and is suspended for 2x interval then #6

Open pbiering opened 2 years ago

pbiering commented 2 years ago

Looks like shortliving processes are not proper catched by processmon

See log:

Dec 07 12:40:06 *** collectd[1844373]: Initialization complete, entering read-loop.
Dec 07 12:40:06 *** collectd[1844373]: processmon plugin: Error reading /proc/1844371/stat
Dec 07 12:40:06 *** collectd[1844373]: processmon plugin: Error reading /proc/1844372/stat
Dec 07 12:40:06 *** collectd[1844373]: read-function of plugin `processmon' failed. Will suspend it for 120.000 seconds.

A patch which catch this case is like following:

diff --git a/processmon.c.orig b/processmon.c
index 754d732..2092b88 100644
--- a/processmon.c.orig
+++ b/processmon.c
@@ -233,6 +233,12 @@ static int pm_read_stat(process_data_t *pd, char *process_state) {
   // read the proc stat file for pid.
   snprintf(fname, sizeof(fname), "/proc/%s/stat", pd->pid);

+  // check in advance whether it is still existing
+  if (access(fname, F_OK) != 0) {
+    WARNING("processmon plugin: Disappeared before reading %s", fname);
+    return 0;
+  };
+
   ssize_t status = read_file_contents(fname, buf, sizeof(buf) - 1);
   if (status <= 0) {
     ERROR("processmon plugin: Error reading %s", fname);

afterwards following appears in log, processmon is no longer suspended:

Dec 07 13:57:36 *** collectd[1847744]: processmon plugin: File disappeared before reading reading /proc/1847741/stat

neuhaus commented 1 year ago

Excellent, i came here looking for a fix for this issue. This patch has an issue with an uninitialized value for process_state but i hope this can be resolved. Perhaps like this?

diff --git a/src/processmon.c b/src/processmon.c
index 754d732..1727bdd 100644
--- a/src/processmon.c
+++ b/src/processmon.c
@@ -233,6 +233,12 @@ static int pm_read_stat(process_data_t *pd, char *process_state) {
   // read the proc stat file for pid.
   snprintf(fname, sizeof(fname), "/proc/%s/stat", pd->pid);

+  // check in advance whether or not it still exists
+  if (access(fname, F_OK) != 0) {
+    WARNING("processmon plugin: Disappeared before reading %s", fname);
+    return 0;
+  }
+
   ssize_t status = read_file_contents(fname, buf, sizeof(buf) - 1);
   if (status <= 0) {
     ERROR("processmon plugin: Error reading %s", fname);
@@ -510,7 +516,7 @@ static int pm_read(void) {
   int total_count = 0;

   // variables to keep counts of process states
-  char process_state;
+  char process_state = '-';
   int state_R = 0;
   int state_S = 0;
   int state_D = 0;
splunk-pwu commented 1 year ago

Sorry for the delay. We're looking into it.

neuhaus commented 1 year ago

Should processmon still be used? Your own documentation no longer mentions it.