TorchRec DLRM README provides an example of using torchx remotely:
torchx run -s slurm dist.ddp -j 1x8 --script dlrm_main.py
This example fails with:
> torchx run -s slurm dist.ddp -j 1x8 --script dlrm_main.py
torchx 2024-08-05 11:53:42 INFO Tracker configurations: {}
torchx 2024-08-05 11:53:42 INFO Checking for changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm`...
torchx 2024-08-05 11:53:42 INFO To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2024-08-05 11:53:42 INFO Reusing original image `ghcr.io/pytorch/torchx:0.7.0` for role[0]=dlrm_main. Either a patch was built or no changes to workspace was detected.
Traceback (most recent call last):
File "/.local/bin/torchx", line 8, in <module>
sys.exit(main())
^^^^^^
File "/.local/lib/python3.12/site-packages/torchx/cli/main.py", line 118, in main
run_main(get_sub_cmds(), argv)
File "/.local/lib/python3.12/site-packages/torchx/cli/main.py", line 114, in run_main
args.func(args)
File "/.local/lib/python3.12/site-packages/torchx/cli/cmd_run.py", line 268, in run
self._run(runner, args)
File "/.local/lib/python3.12/site-packages/torchx/cli/cmd_run.py", line 228, in _run
app_handle = runner.run_component(
^^^^^^^^^^^^^^^^^^^^^
File "/.local/lib/python3.12/site-packages/torchx/runner/api.py", line 200, in run_component
handle = self.schedule(dryrun_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.local/lib/python3.12/site-packages/torchx/runner/api.py", line 308, in schedule
app_id = sched.schedule(dryrun_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.local/lib/python3.12/site-packages/torchx/schedulers/slurm_scheduler.py", line 388, in schedule
p = subprocess.run(req.cmd, stdout=subprocess.PIPE, check=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/subprocess.py", line 548, in run
with Popen(*popenargs, **kwargs) as process:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/subprocess.py", line 1026, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib64/python3.12/subprocess.py", line 1955, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sbatch'
TorchRec DLRM README provides an example of using torchx remotely:
This example fails with:
It appears that the
sbatch
file is missing.I'm using the latest revision of the master branch.