glideinWMS / glideinwms

The glideinWMS Project
http://tinyurl.com/glideinwms
Apache License 2.0
16 stars 45 forks source link

CERN production factory crashes #362

Closed mmascher closed 7 months ago

mmascher commented 1 year ago

Describe the bug The CERN producction factory crashed two times in the past week. It seems it was doing the rotation of the entry logfiles. No alarm was actually fired since the python processes keeps running.

To Reproduce Hard. It just happens from time to time. Maybe increase the rotation frequency of the logs and observe?

Screenshots

[2023-09-27 14:49:59,951] DEBUG: cleanupSupport:37: Forked cleanup PIDS [123125, 123126, 123127, 123128]
[2023-09-27 14:56:55,733] DEBUG: glideFactoryEntryGroup:308: Setting parallel_workers limit of 8
[2023-09-27 15:00:56,094] WARNING: glideFactoryEntryGroup:415: Error occurred while trying to find and do work.
[2023-09-27 15:00:56,095] ERROR: glideFactoryEntryGroup:416: Exception: 
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/glideinwms/factory/glideFactoryEntryGroup.py", line 412, in iterate_one
    do_advertize, factory_in_downtime, glideinDescript, frontendDescript, group_name, my_entries
  File "/usr/lib/python3.6/site-packages/glideinwms/factory/glideFactoryEntryGroup.py", line 344, in find_and_perform_work
    logSupport.roll_all_logs()
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/logSupport.py", line 289, in roll_all_logs
    handler.check_and_perform_rollover()
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/logSupport.py", line 283, in check_and_perform_rollover
    if self.shouldRollover(None, empty_record=True):
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/logSupport.py", line 186, in shouldRollover
    self.stream.seek(0, 2)  # due to non-posix-compliant Windows feature
ValueError: I/O operation on closed file.
[2023-09-27 15:00:56,215] DEBUG: glideFactoryEntryGroup:418: Group Work done: {}

Info (please complete the following information):

mmascher commented 11 months ago

This happened again on Tuesday Novermber 2nd. It was a while.

mambelli commented 7 months ago

389 is making logging more robust and providing more troubleshooting info. This issue can be closed and a new one will be opened if this happens again.