Closed albertz closed 9 months ago
This seems really strange to me. You see in the log:
Submitted with job_id: ['3767058'] ...
Submitted with job_id: ['3767268'] ...
...
More then one matching SLURM task ... matches: [('3767058', 'PENDING'), ('3767268', 'PENDING')]
The "Submitted with job_id" message is correct, it is indeed submitted.
Maybe this is some multi-threading bug? I see that submit_call
is called from multiple threads.
Very strange, especially since both submissions are nearly an hour apart make. This basically eliminates the possibility of caching effects. Looks more like sisyphus did not find the job when checking if the job was submitted, but for some reason it found when checking the current status. Which is strange...
I got this again:
...
[2024-01-09 16:52:19,402] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/espnet/v6-24gb-bs30k-wd1e_6-lrlin1e_5_587k-EBranchformer/train work/i6_core/returnn/training/ReturnnTrainingJob.AituPEamOqTq>
[2024-01-09 16:52:19,402] INFO: error(52) interrupted_resumable(1) queue(4) running(6) waiting(838)
[2024-01-09 16:52:22,662] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.AituPEamOqTq run [1]
[2024-01-09 16:52:22,678] INFO: Submitted with job_id: ['4057250'] i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run
...
[2024-01-09 19:58:31,189] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_20k/train work/i6_core/returnn/training/ReturnnTrainingJob.1Bar8z0wWtjq>
[2024-01-09 19:58:31,189] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_50k/train work/i6_core/returnn/training/ReturnnTrainingJob.por8G6TNlO9E>
[2024-01-09 19:58:31,189] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_600k/train work/i6_core/returnn/training/ReturnnTrainingJob.VAf79gjGIxyn>
[2024-01-09 19:58:31,189] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-warmup100k/train work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ>
[2024-01-09 19:58:31,189] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/espnet/v6-24gb-bs30k-wd1e_6-lrlin1e_5_587k-EBranchformer/train work/i6_core/returnn/training/ReturnnTrainingJob.AituPEamOqTq>
[2024-01-09 19:58:31,189] INFO: error(52) interrupted_resumable(5) running(6) waiting(838)
[2024-01-09 19:59:03,386] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.AituPEamOqTq run [1]
[2024-01-09 19:59:03,402] INFO: Submitted with job_id: ['4059231'] i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run
[2024-01-09 19:59:03,608] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run', 1) > matches: [('4057250', 'PENDING'), ('4059231', 'PENDING')]
[2024-01-09 19:59:04,184] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31 Call: ./sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py
...
[2024-01-09 19:59:04,186] INFO: queue: Job<alias/exp2023_04_25_rf/espnet/v6-24gb-bs30k-wd1e_6-lrlin1e_5_587k-EBranchformer/train work/i6_core/returnn/training/ReturnnTrainingJob.AituPEamOqTq>
[2024-01-09 19:59:04,211] INFO: running: Job<alias/exp2023_04_25_rf/aed/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_2-lrlin1e_5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.4Y1gqDiY2DHe> {ep 365/500}
...
[2024-01-09 19:59:04,306] INFO: error(52) queue(5) running(6) waiting(838)
[2024-01-09 19:59:14,683] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run', 1) > matches: [('4057250', 'PENDING'), ('4059231', 'PENDING')]
[2024-01-09 19:59:25,633] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run', 1) > matches: [('4057250', 'PENDING'), ('4059231', 'PENDING')]
[2024-01-09 19:59:38,399] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run', 1) > matches: [('4057250', 'PENDING'), ('4059231', 'PENDING')]
[2024-01-09 19:59:49,348] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run', 1) > matches: [('4057250', 'PENDING'), ('4059231', 'PENDING')]
...
So it looks a bit like it incorrectly identified it as WARNING: interrupted_resumable
?
I wondered how it can be in this state interrupted_resumable. In Task._get_state_helper
:
self.finished(task_id)
return Falseself.error(task_id)
return False engine_state = engine.task_state(self, task_id)
...
if engine_state == gs.STATE_UNKNOWN:
if self.started(task_id):
# check again if it finished or crashed while retrieving the state
if self.finished(task_id):
return gs.STATE_FINISHED
elif self.error(task_id):
return gs.STATE_ERROR
# job logging file got updated recently, assume job is still running.
# used to avoid wrongly marking jobs as interrupted do to slow filesystem updates
elif self.running(task_id):
return gs.STATE_RUNNING
history = [] if engine is None else engine.get_submit_history(self)
if history and len(history[task_id]) > gs.MAX_SUBMIT_RETRIES:
# More then three tries to run this task, something is wrong
return gs.STATE_RETRY_ERROR
else:
# Task was started, but isn't running anymore => interrupted
if self._resume is None:
return gs.STATE_INTERRUPTED_NOT_RESUMABLE
else:
return gs.STATE_INTERRUPTED_RESUMABLE # <-- going here
self.running(task_id)
returns False if the job logging file has not been updated recently, which sometimes can happen at startup where it sometimes hangs a bit.
So, the main problem here is that engine.task_state(self, task_id)
probably/assumably returned gs.STATE_UNKNOWN
.
Ah, I think I found the problem:
In submit_helper
:
logging.info("Submitted with job_id: %s %s" % (job_id, name))
for task_id in range(start_id, end_id, step_size):
self._task_info_cache[(name, task_id)].append((job_id, "PD"))
But then in task_state
, when it uses the _task_info_cache
:
state = qs[0][1]
if state in ["RUNNING", "COMPLETING"]:
return STATE_RUNNING
elif state in ["PENDING", "CONFIGURING"]:
return STATE_QUEUE
else:
return STATE_UNKNOWN
I.e. "PD"
is simply an unknown state. I guess it should be "PENDING"
.
This is the same message as in #156, but I think the bug (problem) here is different, because from the log, it seems there is no "Error to submit job", and the problem in #156 is also fixed already by #157, and I have seen multiple times that #157 is indeed working as intended.
Maybe it is relevant that when I interrupted the manager here, I got the crash from #164.