pytorch / torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
https://pytorch.org/torchx
Other
333 stars 110 forks source link

SLURM quality of life improvements #405

Open mannatsingh opened 2 years ago

mannatsingh commented 2 years ago

Description

Making a couple of requests to improve QoL on SLURM

Detailed Proposal

It would be helpful to have -

d4l3k commented 2 years ago

adding this support for slurm wouldn't be too bad:

1) generalize the workspace file logic from docker_workspace (.torchxignore) 2) add a job_dir argument to allow specifying an isolation env 3) change launch code to cp + chdir 4) add some statefile (.torchxjobdirs) so torchx log knows where to find logs for slurm

something like job_dir we could relatively easily extend to local_cwd, local_docker -- more complex for k8s/batch/ray

d4l3k commented 2 years ago

For the heterogenous jobs displaying differently, that's tricky in the current model. The macros like replica_id generally need be applied on a per worker basis. If we wrap the app in a runtime it does allow us to materialize those later though it does add an extra dependency. Slurm virtualenv/conda will have TorchX installed anyways in most cases so that's not necessarily a blocker but changes the model from what we've had so far

https://github.com/pytorch/torchx/blob/main/torchx/specs/api.py#L138

I did look but doesn't appear that sacct/squeue has a way to hide child jobs. You can use torchx status so we could add a torchx queue method to render this better for all schedulers

mannatsingh commented 2 years ago

You can use torchx status so we could add a torchx queue method to render this better for all schedulers

I think it's hard to see us migrating to use torchx status - squeue gives the status of all jobs which is what I normally check. torchx wouldn't even be aware of all the jobs being run (since they might have been queued outside of torchx). Even if it did, that's introducing a new workflow which I'd imagine most people would want to avoid (unless it gave them some benefit).

kiukchung commented 2 years ago

re: The job logs are created in per node files

https://github.com/pytorch/torchx/pull/412 makes it so that when running with dist.ddp the node stdout and stderr log lines are prefixed with the local_rank of the worker that produced that line. So you'd see something akin to this:

image

mannatsingh commented 2 years ago

https://github.com/pytorch/torchx/pull/412 makes it so that when running with dist.ddp the node stdout and stderr log lines are prefixed with the local_rank of the worker that produced that line...

We need to work with the lightning team to make sure that the ranks displayed here match the ones used in lightning, which isn't guaranteed to be the case right now as @kiukchung and I discovered the other day.