Open mannatsingh opened 2 years ago
adding this support for slurm wouldn't be too bad:
1) generalize the workspace file logic from docker_workspace (.torchxignore) 2) add a job_dir argument to allow specifying an isolation env 3) change launch code to cp + chdir 4) add some statefile (.torchxjobdirs) so torchx log knows where to find logs for slurm
something like job_dir we could relatively easily extend to local_cwd, local_docker -- more complex for k8s/batch/ray
For the heterogenous jobs displaying differently, that's tricky in the current model. The macros like replica_id
generally need be applied on a per worker basis. If we wrap the app in a runtime it does allow us to materialize those later though it does add an extra dependency. Slurm virtualenv/conda will have TorchX installed anyways in most cases so that's not necessarily a blocker but changes the model from what we've had so far
https://github.com/pytorch/torchx/blob/main/torchx/specs/api.py#L138
I did look but doesn't appear that sacct/squeue has a way to hide child jobs. You can use torchx status
so we could add a torchx queue
method to render this better for all schedulers
You can use torchx status so we could add a torchx queue method to render this better for all schedulers
I think it's hard to see us migrating to use torchx status
- squeue
gives the status of all jobs which is what I normally check. torchx
wouldn't even be aware of all the jobs being run (since they might have been queued outside of torchx). Even if it did, that's introducing a new workflow which I'd imagine most people would want to avoid (unless it gave them some benefit).
re: The job logs are created in per node files
https://github.com/pytorch/torchx/pull/412 makes it so that when running with dist.ddp
the node stdout and stderr log lines are prefixed with the local_rank of the worker that produced that line. So you'd see something akin to this:
https://github.com/pytorch/torchx/pull/412 makes it so that when running with dist.ddp the node stdout and stderr log lines are prefixed with the local_rank of the worker that produced that line...
We need to work with the lightning team to make sure that the ranks displayed here match the ones used in lightning, which isn't guaranteed to be the case right now as @kiukchung and I discovered the other day.
Description
Making a couple of requests to improve QoL on SLURM
Detailed Proposal
It would be helpful to have -
squeue
command including the job name. If our jobs are all run via torchx, every job will be namedtrain_app-{i}
which makes it hard to identify which experiment / project the job is from.time
argument doesn't say what the unit is - maybe we just follow the SLURM API, but it would be nice if we clarified that.squeue
logs show every node as a separate line - so a 32 node job would take 32 lines instead of 1. This just makes it harder to monitor jobs - not a technical issue, just a QoL one :)slurm-{job-id}-train_app-{node-id}.out
files (per node) and a singleslurm-{job-id}.out
. Normally, our jobs instead have logs of the form{job-id}-{node-id}.out
and{job-id}-{node-id}.err
(per node) - the separation betweenstderr
andstdout
helps find which machine actually crashed more easily. And I'm not sure whatslurm-{job-id}.out
corresponds to - maybe it's a consequence of the heterogeneous jobs? With torchelastic, it becomes harder to debug which node crashed since every node logs a crash (so grepping forTraceback
will return each log file instead of just the node which originally crashed) - maybe there is a way to figure this out and I just don't know what to look for?global_rank
is not equal tolocal_rank + node_id * gpus_per_node
, i.e. the global rank 0 can be on node 3.