The bug corrected by 95f27ca was identified when waiting for a worker to be provisioned after submitting a jobs on a slurm cluster. It occurs when called from readLog when fs.latency > 0. While waiting for the log file to appear, fn is a full file path, so isn't found in the call to list.files without the full.names argument.
This solution handles fn as either a full file path or a file name, and provides the full file path in the timeout error message.
548129f
The first bugfix revealed that batchtools::getStatus was returning an incorrect 'expired' status for jobs during machine provisioning, which triggered future.batchtools::await to handle the 'expired' job & terminate early, described here. This was addressed in an inelegant 548129f. It would perhaps be better to update the log.file directly in the registry once the value is known, rather than adding it on the fly, but I don't know the full implications of doing this. I'm also not convinced that the specified timed.out is right - it won't terminate at the same time as the readLog -> waitForFile loop & it may not take into account queuing, but it seems to work well in my environment.
The new 'provisioning' status is strictly unnecessary, as preventing the 'expired' status would have the same effect on future.batchtools::await, but it feels more explicit.
I don't know about compatibility with other environments, but it seemed solid when tested across 500 jobs on a Slurm cluster starting at 0 nodes & limited at 20 nodes with a queue size of 50. Previously the same series of jobs simply wouldn't run unless nodes were persistent & pre-provisioned.
95f27ca
The bug corrected by
95f27ca
was identified when waiting for a worker to be provisioned after submitting a jobs on a slurm cluster. It occurs when called fromreadLog
whenfs.latency > 0
. While waiting for the log file to appear,fn
is a full file path, so isn't found in the call tolist.files
without thefull.names
argument.This solution handles
fn
as either a full file path or a file name, and provides the full file path in the timeout error message.548129f
The first bugfix revealed that
batchtools::getStatus
was returning an incorrect 'expired' status for jobs during machine provisioning, which triggeredfuture.batchtools::await
to handle the 'expired' job & terminate early, described here. This was addressed in an inelegant548129f
. It would perhaps be better to update thelog.file
directly in the registry once the value is known, rather than adding it on the fly, but I don't know the full implications of doing this. I'm also not convinced that the specifiedtimed.out
is right - it won't terminate at the same time as thereadLog -> waitForFile loop
& it may not take into account queuing, but it seems to work well in my environment.The new 'provisioning' status is strictly unnecessary, as preventing the 'expired' status would have the same effect on
future.batchtools::await
, but it feels more explicit.I don't know about compatibility with other environments, but it seemed solid when tested across 500 jobs on a Slurm cluster starting at 0 nodes & limited at 20 nodes with a queue size of 50. Previously the same series of jobs simply wouldn't run unless nodes were persistent & pre-provisioned.