SLURM jobs being set as RUNNING when the actual state of the job is failed

sgopalan98 commented 2 months ago

Bug report

Expected behavior and actual behavior

Situation: When SLURM jobs are submitted by Nextflow to a SLURM node, sometimes the SLURM node fail to startup. This causes the SLURM job to take NF - Node Fail status.

Expected behaviour: Nextflow should report that this job failed because of node failure. Actual behaviour: Nextflow interprets that the job started (because of code not handling this case) but fails to find the .exitcode or mark the job as active (as it will never get into RUNNING state). So, the job throws error because of exitReadTimeOut eventually.

Steps to reproduce the problem

Nextflow process configured to run on a node that fails to startup.

Program output

I don't have the output. But, I have the nextflow.log file with TRACE enabled. Please look for Job ID: 4078. Lines 13076, 13101, 13126

Environment

Nextflow version: 22.10.6
Java version: 17.0.9
Operating system: Linux
Bash version: GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-gnu)

Additional context

The problem I think is in https://github.com/nextflow-io/nextflow/blob/a0f69025854d843e0e12bac651c86bc552642e76/modules/nextflow/src/main/groovy/nextflow/executor/AbstractGridExecutor.groovy#L371-L385 .

This might be related to https://github.com/nextflow-io/nextflow/issues/4962 , but I am not sure... I didn't read through the full logs.

pditommaso commented 2 months ago

Thanks for reporting this. I've noticed you are using version 22.10.6, any chance to try latest version?

sgopalan98 commented 2 months ago

Thanks for responding!

I am trying it with the latest version now and will report the findings. However, my understanding of the code snippet (which isn't much, to be honest) is that for any SLURM status other than PENDING, the job would be considered as started. I will confirm this with my run on the latest version and update you with the findings.

sgopalan98 commented 2 months ago

This seems to be the case for the version 24.04.4 as well. I am attaching the .nextflow.log with trace with the newer version as well.

For Job Id: 4151 at lines 14407, 14432, 14498 in the nextflow.log say that it has NODE FAILURE in SLURM but Nextflow picks it up as running.

jorgee commented 1 month ago

The NF(NODE FAIL) is managed in the same way as F (FAILED). They are considered an QueueStatus.ERROR that has been started and not active. For all submitted and started tasks that are not active, it tries to get the exit code. If it doesn't exists, it wait until exitReadTimeout is reached. According to some comments in the code, it seems this behavior is intended to avoid NFS issues.

sgopalan98 commented 1 month ago

Thank you for looking into this! I understand that the job fails eventually, but the problem mainly is that there is no log which says that the node has failed and that is why the process failed.

jorgee commented 1 month ago

I see what is happening. The final error message includes the dump of the queue status, but the failing job doesn't appear because the Slurm squeue command is not printing old jobs after a certain time. I will check how it could be improved

nextflow-io / nextflow