mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
169 stars 51 forks source link

Bugfix in waitForFile where fn was full file path #274

Closed stuvet closed 2 years ago

stuvet commented 3 years ago

95f27ca

The bug corrected by 95f27ca was identified when waiting for a worker to be provisioned after submitting a jobs on a slurm cluster. It occurs when called from readLog when fs.latency > 0. While waiting for the log file to appear, fn is a full file path, so isn't found in the call to list.files without the full.names argument.

This solution handles fn as either a full file path or a file name, and provides the full file path in the timeout error message.

548129f

The first bugfix revealed that batchtools::getStatus was returning an incorrect 'expired' status for jobs during machine provisioning, which triggered future.batchtools::await to handle the 'expired' job & terminate early, described here. This was addressed in an inelegant 548129f. It would perhaps be better to update the log.file directly in the registry once the value is known, rather than adding it on the fly, but I don't know the full implications of doing this. I'm also not convinced that the specified timed.out is right - it won't terminate at the same time as the readLog -> waitForFile loop & it may not take into account queuing, but it seems to work well in my environment.

The new 'provisioning' status is strictly unnecessary, as preventing the 'expired' status would have the same effect on future.batchtools::await, but it feels more explicit.

I don't know about compatibility with other environments, but it seemed solid when tested across 500 jobs on a Slurm cluster starting at 0 nodes & limited at 20 nodes with a queue size of 50. Previously the same series of jobs simply wouldn't run unless nodes were persistent & pre-provisioned.

stuvet commented 2 years ago

PR closed as superceded by #276 & #277