Open sgopalan98 opened 2 months ago
Thanks for reporting this. I've noticed you are using version 22.10.6
, any chance to try latest version?
Thanks for responding!
I am trying it with the latest version now and will report the findings. However, my understanding of the code snippet (which isn't much, to be honest) is that for any SLURM status other than PENDING, the job would be considered as started. I will confirm this with my run on the latest version and update you with the findings.
This seems to be the case for the version 24.04.4
as well. I am attaching the .nextflow.log with trace with the newer version as well.
For Job Id: 4151 at lines 14407, 14432, 14498 in the nextflow.log say that it has NODE FAILURE in SLURM but Nextflow picks it up as running.
The NF(NODE FAIL) is managed in the same way as F (FAILED). They are considered an QueueStatus.ERROR that has been started and not active. For all submitted and started tasks that are not active, it tries to get the exit code. If it doesn't exists, it wait until exitReadTimeout is reached. According to some comments in the code, it seems this behavior is intended to avoid NFS issues.
Thank you for looking into this! I understand that the job fails eventually, but the problem mainly is that there is no log which says that the node has failed and that is why the process failed.
I see what is happening. The final error message includes the dump of the queue status, but the failing job doesn't appear because the Slurm squeue command is not printing old jobs after a certain time. I will check how it could be improved
Bug report
Expected behavior and actual behavior
Situation: When SLURM jobs are submitted by Nextflow to a SLURM node, sometimes the SLURM node fail to startup. This causes the SLURM job to take
NF
- Node Fail status.Expected behaviour: Nextflow should report that this job failed because of node failure. Actual behaviour: Nextflow interprets that the job started (because of code not handling this case) but fails to find the
.exitcode
or mark the job as active (as it will never get into RUNNING state). So, the job throws error because of exitReadTimeOut eventually.Steps to reproduce the problem
Nextflow process configured to run on a node that fails to startup.
Program output
I don't have the output. But, I have the nextflow.log file with TRACE enabled. Please look for Job ID:
4078
. Lines 13076, 13101, 13126Environment
Additional context
The problem I think is in https://github.com/nextflow-io/nextflow/blob/a0f69025854d843e0e12bac651c86bc552642e76/modules/nextflow/src/main/groovy/nextflow/executor/AbstractGridExecutor.groovy#L371-L385 .
This might be related to https://github.com/nextflow-io/nextflow/issues/4962 , but I am not sure... I didn't read through the full logs.