glideinWMS / glideinwms

The glideinWMS Project
http://tinyurl.com/glideinwms
Apache License 2.0
16 stars 46 forks source link

Factory monitor not updating after factory lockup #324

Open mmascher opened 1 year ago

mmascher commented 1 year ago

Describe the bug The UCSD factory machine locked up for an unknown reason (possibly a cooling issue in the room). Once the machine recovered the monitor was not available. Turned out some monitor cache file were empty and the factory was not expecting that.

To Reproduce Run the factory for a while and then make one of the ftspk file empty, for example: /var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk

Expected behavior The corner case should be handled correctly and monitor available.

Info (please complete the following information):

Additional context

...
[2023-07-29 23:37:07,142] DEBUG: glideFactoryEntry:1058: Checking security credentials for client UCSD-CMS-Frontend.main
[2023-07-29 23:37:07,218] ERROR: glideFactoryEntry:1819: Could not read /var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 1386, in loadCache
    data = util.file_pickle_load(fname)
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/util.py", line 306, in file_pickle_load
    conditional_raise(mask_exceptions)
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/util.py", line 295, in file_pickle_load
    data = pickle.load(fo)
EOFError: Ran out of input

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/glideinwms/factory/glideFactoryEntry.py", line 1817, in perform_work_v3
    log_stats[credential_username + ":" + client_int_name].load()
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 671, in load
    obj.load()
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 82, in load
    return self.loadCache()
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 104, in loadCache
    self.data = loadCache(self.cachename)
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 1388, in loadCache
    raise RuntimeError("Could not read %s" % fname)
RuntimeError: Could not read /var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk
[2023-07-29 23:38:34,834] DEBUG: glideFactoryEntry:1058: Checking security credentials for client UCSD-CMS-Frontend.main
...
mambelli commented 1 year ago

This is similar to Issue #338, fixed in PR #339. This was visible also in the upgrades tested under EL9. A protection was added before invoking the RRD libraries. @mmascher To test and under 3.10.5 close