Describe the bug
The UCSD factory machine locked up for an unknown reason (possibly a cooling issue in the room). Once the machine recovered the monitor was not available. Turned out some monitor cache file were empty and the factory was not expecting that.
To Reproduce
Run the factory for a while and then make one of the ftspk file empty, for example:
/var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk
Expected behavior
The corner case should be handled correctly and monitor available.
Info (please complete the following information):
Priority: low
Stakeholders: FactoryOps
Components: factory monitoring
Additional context
...
[2023-07-29 23:37:07,142] DEBUG: glideFactoryEntry:1058: Checking security credentials for client UCSD-CMS-Frontend.main
[2023-07-29 23:37:07,218] ERROR: glideFactoryEntry:1819: Could not read /var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 1386, in loadCache
data = util.file_pickle_load(fname)
File "/usr/lib/python3.6/site-packages/glideinwms/lib/util.py", line 306, in file_pickle_load
conditional_raise(mask_exceptions)
File "/usr/lib/python3.6/site-packages/glideinwms/lib/util.py", line 295, in file_pickle_load
data = pickle.load(fo)
EOFError: Ran out of input
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/glideinwms/factory/glideFactoryEntry.py", line 1817, in perform_work_v3
log_stats[credential_username + ":" + client_int_name].load()
File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 671, in load
obj.load()
File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 82, in load
return self.loadCache()
File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 104, in loadCache
self.data = loadCache(self.cachename)
File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 1388, in loadCache
raise RuntimeError("Could not read %s" % fname)
RuntimeError: Could not read /var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk
[2023-07-29 23:38:34,834] DEBUG: glideFactoryEntry:1058: Checking security credentials for client UCSD-CMS-Frontend.main
...
This is similar to Issue #338, fixed in PR #339. This was visible also in the upgrades tested under EL9.
A protection was added before invoking the RRD libraries.
@mmascher To test and under 3.10.5 close
Describe the bug The UCSD factory machine locked up for an unknown reason (possibly a cooling issue in the room). Once the machine recovered the monitor was not available. Turned out some monitor cache file were empty and the factory was not expecting that.
To Reproduce Run the factory for a while and then make one of the ftspk file empty, for example: /var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk
Expected behavior The corner case should be handled correctly and monitor available.
Info (please complete the following information):
Additional context