huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.02k stars 143 forks source link

Incorrect Job ID Extraction on Clusters with Custom Slurm Output #265

Open StephenRebelSSC opened 2 months ago

StephenRebelSSC commented 2 months ago

Description: When running a 'SlurmPipelineExecutor' pipeline on my HPC cluster, I encounter dependency issues that result in a failed execution. The problem arises during the stats collection step after a stage has completed (observed during initial testing).

The issue originates from the 'launch_slurm_job' function in 'src/datatrove/executor/slurm.py':

def launch_slurm_job(launch_file_contents, *args):
    """
        Small helper function to save a sbatch script and call it.
    Args:
        launch_file_contents: Contents of the sbatch script
        *args: any other arguments to pass to the sbatch command

    Returns: the id of the launched slurm job

    """
    with tempfile.NamedTemporaryFile("w") as f:
        f.write(launch_file_contents)
        f.flush()
        return subprocess.check_output(["sbatch", *args, f.name]).decode("utf-8").split()[-1]

The issues arises because my cluster's job submission output of the following form: Submitted batch job <job_id> on cluster <cluster_name>

In this format, the 'launch_slurm_job' function incorrectly reads the cluster name as the job ID, leading to errors in subsequent stages that depend on the correct job ID.

Reproducibility:

  1. Set up a pipeline using 'SlurmPipelineExecutor'.
  2. Run the pipeline on a cluster with a similar job submission output format as described above.
  3. Observe the dependency error and failed execution.

Proposed Solution: To handle clusters with non-standard job submission output formats, I propose adding an argument or flag to the 'SlurmPipelineExecutor' class. This would allow users to specify the correct position of the job ID in the submission output.

In my case, I temporarily resolved the issue by hardcoding the function to extract the job ID from the 4th element in the subprocess call return.

Additional Information:

Please let me know if further details or a pull request would be helpful.

guipenedo commented 2 months ago

Hi! Would you be willing to PR this change?

StephenRebelSSC commented 2 months ago

Yeah for sure.