When checking if we need to add a loop termination command to the commands of an action, we callWorkflow.get_iteration_final_run_IDs, which in turn callsWorkflow.get_loop_map. This then callsWorkflow.get_EARs_from_IDs (which reads the runs metadata array) on all run IDs from that submission, which could be many thousands of runs for large workflows.
In principle, this shouldn't be a problem because Zarr support multiprocess reading. In practice, it seems something is going wrong here under high concurrency scenarios (i.e. using a large job array when the cluster has very good availability). We get random RuntimeErrors from numcodecs during the chunk decompression from this metadata array. These errors are guarded against using the reretry package. However, for tasks that should be quick, this introduces a potentially lengthy delay to execution, especially for large workflows.
Additionally, reading the whole array is slow on Lustre file systems in general, because this array must be single-chunked (one chunk/file per run) to allow for multi-process writing during execution. So we ideally want to avoid reading most of/the whole array anyway.
Two steps to solve:
[x] Fix for the case where the workflow has no loops. This is easy, and should just require an wrapping some existing code in an if statement.
[ ] Fix for the case where the workflow has loops.
When checking if we need to add a loop termination command to the commands of an action, we call
Workflow.get_iteration_final_run_IDs
, which in turn callsWorkflow.get_loop_map
. This then callsWorkflow.get_EARs_from_IDs
(which reads the runs metadata array) on all run IDs from that submission, which could be many thousands of runs for large workflows.In principle, this shouldn't be a problem because Zarr support multiprocess reading. In practice, it seems something is going wrong here under high concurrency scenarios (i.e. using a large job array when the cluster has very good availability). We get random
RuntimeError
s fromnumcodecs
during the chunk decompression from this metadata array. These errors are guarded against using thereretry
package. However, for tasks that should be quick, this introduces a potentially lengthy delay to execution, especially for large workflows.Additionally, reading the whole array is slow on Lustre file systems in general, because this array must be single-chunked (one chunk/file per run) to allow for multi-process writing during execution. So we ideally want to avoid reading most of/the whole array anyway.
Two steps to solve:
if
statement.