Closed mseitzer closed 7 months ago
For me it seems to be working. Here is the log from one of my test runs which looks good to me (only entries of job 0 for better readability):
2024-04-24 15:56:35,733 - cluster_utils - INFO - Job with id 0 submitted with cluster id 232808
2024-04-24 15:56:44,337 - cluster_utils - INFO - Job 0 started on hostname galvani-cn001.sdn
2024-04-24 16:00:14,869 - cluster_utils - INFO - Job 0 exited to be resumed.
2024-04-24 16:00:14,921 - cluster_utils - INFO - Job with id 0 re-submitted with cluster id 232834
2024-04-24 16:00:44,459 - cluster_utils - INFO - Job 0 started on hostname galvani-cn001.sdn
2024-04-24 16:04:14,846 - cluster_utils - INFO - Job 0 exited to be resumed.
2024-04-24 16:04:14,915 - cluster_utils - INFO - Job with id 0 re-submitted with cluster id 232862
2024-04-24 16:04:45,600 - cluster_utils - INFO - Job 0 started on hostname galvani-cn001.sdn
2024-04-24 16:07:46,264 - cluster_utils - INFO - Job 0 sent results.
2024-04-24 16:07:46,265 - cluster_utils - INFO - Job 0 finished successfully.
Is the behaviour always the same for you? If yes, are you aware of any settings that affect the logging?
I will need to monitor this and see if it comes up more often. My hunch is that it occurs only with longer runs (>12h), and depends on external events, e.g. with storage system. I am not aware of any settings.
From looking at Python's logging implementation, this seems plausible: the file stream is opened on initialization, and kept open for the entire time. If the underlying file becomes unavailable, we will silently not log to it anymore, because such exceptions are caught by the logging framework. The logger will not attempt to reopen the file.
I think using the WatchedFileHandler instead of the basic one would resolve this for us, as it stats
the file before attempting to write, and if that fails, reopens the stream.
I have some jobs running on Slurm with automatic restarting. Restarting works, indicating that the server correctly re-enqueued the jobs.
Observed Behavior: on Galvani, no messages appear in the
cluster_run.log
after the initial "Job started" messages.Expected Behavior: log messages like "sent intermediate results", "exited for resume", "job started" appear in the log.
Possible problems: some issue with the logger not flushing? problem with the log file being closed, maybe because there were some intermediate filesystem issues? Need to look into how exactly the Python logger works.