huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.91k stars 134 forks source link

adding an automatic log file `tail` under slurm executor #246

Open stas00 opened 1 month ago

stas00 commented 1 month ago

When using a local executor the running logs appear right away, in the console it was launched from. But when using slurm one has to fish for the log files.

This can be made easier by automatically printing:

print(f"tail -F {logging_dir}/slurm_logs/{first_slurm_job_id}_0.out")

first_slurm_job_id coming from:

2024-07-10 01:38:05.605 | INFO     | datatrove.executor.slurm:launch_job:280 -
 Slurm job launched successfully with (last) id=109019.

though we want the first, not the last one here.


even fancier would be to run the tail on behalf of the user in the launcher - this way the local and slurm launching experiences will be identical.

But even printing the command to copy-n-paste would already be faster than manual fishing for the log file.


if this doesn't resonate as a feature is it possible to make run() return some attributes? e.g. the first slurm job id - and then the user can code this feature easily themselves.

Thank you!


reading the code I see launch_slurm_job returns some job id and it's then set into run.job_id but this would only be correct if tasks<1000, correct? otherwise it'll return the last job array and not the first one (since your log says ... (last) id=)?

stas00 commented 1 month ago

this seems to do the trick:

        dist_executor.run()

        print(f"*** Find the slurm logs under: {root_dir}/logs/slurm_processing/slurm_logs/ ")
        if dist_executor.job_id != -1:
            print(f"tail -F {root_dir}/logs/slurm_processing/slurm_logs/{dist_executor.job_id}_0.out")