mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.62k stars 560 forks source link

TorchRec DLRM No such file or directory: 'sbatch' #759

Closed rvernica closed 3 months ago

rvernica commented 3 months ago

TorchRec DLRM README provides an example of using torchx remotely:

torchx run -s slurm dist.ddp -j 1x8 --script dlrm_main.py

This example fails with:

> torchx run -s slurm dist.ddp -j 1x8 --script dlrm_main.py
torchx 2024-08-05 11:53:42 INFO     Tracker configurations: {}
torchx 2024-08-05 11:53:42 INFO     Checking for changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm`...
torchx 2024-08-05 11:53:42 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2024-08-05 11:53:42 INFO     Reusing original image `ghcr.io/pytorch/torchx:0.7.0` for role[0]=dlrm_main. Either a patch was built or no changes to workspace was detected.
Traceback (most recent call last):
  File "/.local/bin/torchx", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/.local/lib/python3.12/site-packages/torchx/cli/main.py", line 118, in main
    run_main(get_sub_cmds(), argv)
  File "/.local/lib/python3.12/site-packages/torchx/cli/main.py", line 114, in run_main
    args.func(args)
  File "/.local/lib/python3.12/site-packages/torchx/cli/cmd_run.py", line 268, in run
    self._run(runner, args)
  File "/.local/lib/python3.12/site-packages/torchx/cli/cmd_run.py", line 228, in _run
    app_handle = runner.run_component(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/.local/lib/python3.12/site-packages/torchx/runner/api.py", line 200, in run_component
    handle = self.schedule(dryrun_info)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.local/lib/python3.12/site-packages/torchx/runner/api.py", line 308, in schedule
    app_id = sched.schedule(dryrun_info)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.local/lib/python3.12/site-packages/torchx/schedulers/slurm_scheduler.py", line 388, in schedule
    p = subprocess.run(req.cmd, stdout=subprocess.PIPE, check=True)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib64/python3.12/subprocess.py", line 1955, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sbatch'

It appears that the sbatch file is missing.

I'm using the latest revision of the master branch.

rvernica commented 3 months ago

Fixed with sudo dnf install slurm