Description:
When running a 'SlurmPipelineExecutor' pipeline on my HPC cluster, I encounter dependency issues that result in a failed execution. The problem arises during the stats collection step after a stage has completed (observed during initial testing).
The issue originates from the 'launch_slurm_job' function in 'src/datatrove/executor/slurm.py':
def launch_slurm_job(launch_file_contents, *args):
"""
Small helper function to save a sbatch script and call it.
Args:
launch_file_contents: Contents of the sbatch script
*args: any other arguments to pass to the sbatch command
Returns: the id of the launched slurm job
"""
with tempfile.NamedTemporaryFile("w") as f:
f.write(launch_file_contents)
f.flush()
return subprocess.check_output(["sbatch", *args, f.name]).decode("utf-8").split()[-1]
The issues arises because my cluster's job submission output of the following form:
Submitted batch job <job_id> on cluster <cluster_name>
In this format, the 'launch_slurm_job' function incorrectly reads the cluster name as the job ID, leading to errors in subsequent stages that depend on the correct job ID.
Reproducibility:
Set up a pipeline using 'SlurmPipelineExecutor'.
Run the pipeline on a cluster with a similar job submission output format as described above.
Observe the dependency error and failed execution.
Proposed Solution:
To handle clusters with non-standard job submission output formats, I propose adding an argument or flag to the 'SlurmPipelineExecutor' class. This would allow users to specify the correct position of the job ID in the submission output.
In my case, I temporarily resolved the issue by hardcoding the function to extract the job ID from the 4th element in the subprocess call return.
Additional Information:
Environment: HPC cluster with custom Slurm job submission output
Version: Most recent version of datatrove installed from source.
Impact: This issue will affect users on systems with similar output formats.
Please let me know if further details or a pull request would be helpful.
Description: When running a 'SlurmPipelineExecutor' pipeline on my HPC cluster, I encounter dependency issues that result in a failed execution. The problem arises during the stats collection step after a stage has completed (observed during initial testing).
The issue originates from the 'launch_slurm_job' function in 'src/datatrove/executor/slurm.py':
The issues arises because my cluster's job submission output of the following form:
Submitted batch job <job_id> on cluster <cluster_name>
In this format, the 'launch_slurm_job' function incorrectly reads the cluster name as the job ID, leading to errors in subsequent stages that depend on the correct job ID.
Reproducibility:
Proposed Solution: To handle clusters with non-standard job submission output formats, I propose adding an argument or flag to the 'SlurmPipelineExecutor' class. This would allow users to specify the correct position of the job ID in the submission output.
In my case, I temporarily resolved the issue by hardcoding the function to extract the job ID from the 4th element in the subprocess call return.
Additional Information:
Please let me know if further details or a pull request would be helpful.