Missing monitoring and glidefactoryclient ClassAd on Alma9

Adding detailed notes from the investigation:

As described in the issue description, along with the glidefactoryclient classad missing and no monitoring information available (rrd files) in /var/lib/gwms-factory/work-dir/monitor/entry_ITB_CE_EL9_SciToken, the following lines were found in /var/log/gwms-factory/server/factory/group_0.err.log:

[2023-09-12 16:44:32,714] DEBUG: glideFactoryEntryGroup:357: All children forked for glideFactoryEntry.check_and_perform_work terminated - took 0.07165884971618652 seconds. Loading post work state for the entry.
[2023-09-12 16:44:32,714] DEBUG: glideFactoryEntryGroup:368: No work found for entry ITB_CE_EL9_SciToken from any frontends
[2023-09-12 16:44:32,714] DEBUG: glideFactoryEntryGroup:378: Unable to process response from one or more children for check_and_perform_work. One or more forked processes may have failed and may not have client_stats updated
[2023-09-12 16:44:32,715] WARNING: glideFactoryEntryGroup:381: Unable to process response from one or more children for check_and_perform_work. One or more forked processes may have failed and may not have client_stats updated
[2023-09-12 16:44:32,715] DEBUG: glideFactoryEntryGroup:425: Group Work done: {}
[2023-09-12 16:44:32,826] DEBUG: cleanupSupport:45: Collected cleanup PID 449675
[2023-09-12 16:44:32,826] DEBUG: cleanupSupport:45: Collected cleanup PID 449677
[2023-09-12 16:45:32,596] WARNING: cleanupSupport:23: Earlier cleanup PIDs [449676, 449678] still exist; skipping this cycle
[2023-09-12 16:45:32,645] DEBUG: glideFactoryEntryGroup:309: Setting parallel_workers limit dynamically based on the available free memory
[2023-09-12 16:45:32,645] DEBUG: glideFactoryEntryGroup:315: Setting parallel_workers limit of 1
[2023-09-12 16:45:32,711] ERROR: fork:131: Re-raising exception during read: Ran out of input
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/glideinwms/lib/fork.py", line 127, in fetch_fork_result
    out = pickle.loads(rin)
EOFError: Ran out of input
[2023-09-12 16:45:32,712] WARNING: fork:290: Failed to extract info from child 'ITB_CE_EL9_SciToken': Exception during read probably due to worker failure, original exception and trace <class 'EOFError'>: Ran out of input
[2023-09-12 16:45:32,712] ERROR: fork:291: Failed to extract info from child 'ITB_CE_EL9_SciToken': Exception during read probably due to worker failure, original exception and trace <class 'EOFError'>: Ran out of input
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/glideinwms/lib/fork.py", line 127, in fetch_fork_result
    out = pickle.loads(rin)
EOFError: Ran out of input

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/glideinwms/lib/fork.py", line 265, in fetch_ready_fork_result_list
    out = fetch_fork_result(fd, pid)
  File "/usr/lib/python3.9/site-packages/glideinwms/lib/fork.py", line 132, in fetch_fork_result
    raise FetchError(
  File "/usr/lib/python3.9/site-packages/glideinwms/lib/fork.py", line 127, in fetch_fork_result
    out = pickle.loads(rin)
glideinwms.lib.fork.FetchError: Exception during read probably due to worker failure, original exception and trace <class 'EOFError'>: Ran out of input

Additionally, the /var/log/gwms-factory/server/factory/factory.err.log also had the following line:

...
[2023-09-12 17:21:18,714] WARNING: glideFactory:682: EntryGroup 0 STDERR: b'double free or corruption (out)\n'
...

These messages were being encountered for the first time and was something that we were unfamiliar of as this had not happened with GWMS factory and frontend on SL7 machines. Doing a quick search about these error messages revealed a SO discussion which explained that this basically means that the Python interpreter is crashing. Also, checked the versions of rrdtool package installed on SL7 versus Alma9. There was a difference in the version: SL7 had rrdtool-1.8.x from osg repo while Alma9 had rrdtool-1.7.2 from appstream repo. Our initial suspect was something about the version of the rrdtool itself. Further, a few in-depth debugging sessions were conducted to drill down on the root cause of the overarching problem. We were able to isolate the problem to something in the RRD binding invocation, which was the following line in rrdSupport.py:

return self.rrd_obj.fetch(*args)

Breakpoint stack (to reproduce): Starting the debugger in glideFactory.py, added breakpoints as follows:

in glideFactoryEntryGroup.py, line
    work_done = glideFactoryEntry.check_and_perform_work(factory_in_downtime, entry, work)
in glideFactoryEntry.py, line
    entry.gflFactoryConfig.rrd_stats.getData("%s_%s" % sec_el, monitoringConfig=entry.monitoringConfig)
in glideFactoryMonitoring.py, line
    fetched_data = self.fetchData(rrd_file=rrd, pathway=self.base_dir + "/" + client, start=start, end=end, res=rrd_res) ...
in glideFactoryMonitoring.py, line
    fetched = baseRRDSupport.fetch_rrd(pathway + rrd_file, "AVERAGE", resolution=res, start=start, end=end)
in rrdSupport.py, line
    return self.rrd_obj.fetch(*args)

rrdSupport.py can work both with binding or invoking the tools. As a simple test, we tried commenting the rrdtool binding import in rrdSupport.py. Doing so fixed the issue but based on previous experience, this solution was not scalable. Next, as a step towards troubleshooting, a simple 3-step python script was created to attempt to crash the Python interpreter independently, outside of the GWMS environment to understand whether the problem was at the GlideinWMS framework level or the rrdtool library itself. Here the main part:

import rrdtool
static_args = ['/tmp/not_existing_file', 'AVERAGE', '-r', '600', '-e', '1693345500', '-s', '1693338300']
try:
    result = rrdtool.fetch(*static_args)
except Exception as e:
    print(f"Exception caught: {e}")
print(type(result))
print(result)

The full test script has been attached to this comment and includes variations of tests conducted along with some observations. rrdtool_crash_script.txt

The test script was run on SL7 and Alma9 factory environment since different versions of rrdtool was observed. On Alma9 where rrdtool version was 1.7.2, when rrd files are not existent, executing the code resulted in a "double free or corruption (out) Aborted (core dumped)" error in interpreter mode and when run as a script from the command line, it resulted in a Segmentation fault (core dumped) message. However, on SL7 with version 1.8.x, when the test script was run from the command line, if the file was not found, the rrdtool library was handling the non-existence of the file by throwing an exception saying "No such file or directory". When the same script was run in interpreter mode, the code was found to be throwing rrdtool.OperationalError. To be noted that we never noticed a Python crash also with previous versions of EL and rrdtools.

Finally, given these observations, we decided to add a fix that simply checks whether the rrd file exists and calls the native rrdtool method fetch only when the file is present. Therefore never triggering the condition that now causes a segmentation fault of the Python interpreter.

Namratha observed an instance on Alma 9 where the was no crash of the Python interpreter also with the old version of the Factory code. This will require further investigation using the test script and the GWMS code.

glideinWMS / glideinwms

Missing monitoring and glidefactoryclient ClassAd on Alma9 #338