Open OFSkean opened 1 year ago
Hi @OFSkean, my first instinct would be to only use a single task and use Trainer(accelerator="gpu", devices=4, strategy="ddp")
in your train.py to spin up the parallel GPU processes. You will need to put wandb.init()
in an if block like this:
if __name__ == "__main__":
# Get args
args = parse_args()
if args.local_rank == 0: # only on main process
# Initialize wandb run
run = wandb.init(
entity=args.entity,
project=args.project,
)
# Train model with DDP
train(args, run)
else:
train(args)
So that you can only create a run in the rank 0 process.
I don't think we have a complete example of Slurm + Sweep + DDP but happy to work through this if this doesn't work.
Hi @nate-wandb,
I tried setting ntasks=1
, and while that solves the problem of wandb agent
being called too much, it causes issues with Pytorch Lightning. Per this doc, when using SLURM with Lightning, ntasks
must equal the number of devices.
There are two parametres in the SLURM submission script that determine how many processes will run your training, the
#SBATCH --nodes=X
setting and#SBATCH --ntasks-per-node=Y
settings. The numbers there need to match what is configured in your Trainer in the code:Trainer(num_nodes=X, devices=Y)
. If you change the numbers, update them in BOTH places.
I don't know enough about Lightning to know why that's required when using SLURM. But pretty much I have to keep ntasks=4
.
Ok I figured out a way to do this. It's really ugly but it works.
Sweep Configuration
project: slurm_test
name: my_sweep
program: main.py
command:
- ${env}
- echo **this line is different than usual**
- python3
- ${program}
- ${args}
SLURM sbatch script
#!/bin/bash
#NUMBER OF AGENTS TO REGISTER AS WANDB AGENTS
#SHOULD BE -array=1-X, where X is number of estimated runs
#SBATCH --array=1-4 #e.g. 1-4 will create agents labeled 1,2,3,4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=8
#SBATCH --ntasks=4 # must equal number of gpus, as required by Lightning
.... other SLURM configuration like partition, time, etc ....
.... module purge and start conda environment if needed .....
# SET SWEEP_ID HERE. Note sweep must already be created on wandb before submitting job
SWEEP_ID="**************************************"
API_KEY="******************************************"
# LOGIN IN ALL TASKS
srun wandb login $api_key
# adapted from https://stackoverflow.com/questions/11027679/capture-stdout-and-stderr-into-different-variables
# RUN WANDB AGENT IN ONE TASK
{
IFS=$'\n' read -r -d '' SWEEP_DETAILS; RUN_ID=$(echo $SWEEP_DETAILS | sed -e "s/.*\[\([^]]*\)\].*/\1/g" -e "s/[\'\']//g")
IFS=$'\n' read -r -d '' SWEEP_COMMAND;
} < <((printf '\0%s\0' "$(srun --ntasks=1 wandb agent --count 1 $SWEEP_ID)" 1>&2) 2>&1)
SWEEP_COMMAND="${SWEEP_COMMAND} --wandb_resume_version ${RUN_ID}"
# WAIT FOR ALL TASKS TO CATCH UP
wait
# RUN SWEEP COMMAND IN ALL TASKS
srun $SWEEP_COMMAND
Python code using Pytorch Lightning DDP
... whatever code before
wandb_logger = pl.loggers.WandbLogger(name=args.experiment_name, version=args.wandb_resume_version, resume="must")
# once again, pytorch lightning requires setting:
# devices=number of allocated slurm gpus = number of slurm tasks
# num_nodes= number of slurm nodes
trainer = pl.Trainer(devices=args.devices, num_nodes=args.num_nodes,
accelerator='gpu', strategy='ddp_find_unused_parameters_false',
logger=wandb_logger)
... whatever training code after .....
So the basic flow of whats going on here:
Make sweep configuration in sweep.yaml. Instead of the default python3 {args}
, set the command to echo python3 {args}
. Run wandb sweep sweep.yaml
. Put the created sweep_id and your API key into the slurm script.
When the SLURM script starts running, ALL of the ntasks
tasks will execute wandb login $api_key
.
Only ONE task will execute wandb agent --count 1 $SWEEP_ID
. This will create a run for the sweep and capture the echo python3 {args}
via stdout and run_id
via stderr.
Because the wandb sweep command is echo
rather than python3
, the call to wandb agent
will finish immediately. So we have to resume the run with RUN_ID
in our pytorch lightning code. I use an argument called wandb_resume_version
to do this.
ALL tasks will execute the SWEEP_COMMAND
.
So once again, this is a roundabout way to do this but I couldn't find any better solution. I'll also say that this would be greatly simplified if there was a way to get run_id into the wandb command. Is it possible to add run_id like below in sweep.yaml?
command:
- ${env}
- echo
- python3
- ${program}
- ${args}
- ${run_id}
Hi @OFSkean, I can make a feature request to add the - ${run_id} to the command. I'm glad you have a workaround even though it's not ideal in the meantime.
I can also add that a general way to run SLURM + SWEEP + DDP is requested as well since the above is more of a workaround it seems rather than an official way to run this.
Hi @nate-wandb, yes please make a feature request for ${run_id}. It would be great if it adds the run_id as an argument to the command such as python3 main.py --random-arg 42 --run_id abcdefg
.
Ok, I've submitted this to the team and can follow up once they have a chance to look into this.
Hi, any updates yet?
Hi @TheLukaDragar, we are currently working on better supporting Sweeps on Slurm as the solution but unfortunately this work is not expected to land until early 2024
Hi @nate-wandb, any updates on supporting Sweeps on Slurm?
Hi @sararb, the solution for this is going to be enabling our launch product to run Slurm jobs. Unfortunately, this is still potentially a few quarters away. I'm bumping the priority on this though since this has been requested several times to see if we can get this planned sooner. I'll provide an update as soon as there is any progress on this
WandB Internal User commented: thesofakillers commented: I can't even get WandB Sweep + SLURM + Pytorch Lightning (1 GPU) working. I just get "ValueError: signal only works in main thread"
Hi @sararb, the solution for this is going to be enabling our launch product to run Slurm jobs. Unfortunately, this is still potentially a few quarters away. I'm bumping the priority on this though since this has been requested several times to see if we can get this planned sooner. I'll provide an update as soon as there is any progress on this
Hi @nate-wandb any updates on this?
Hey @MostHumble, no updates on this as of yet. We are still working on Sweeps/Launch on SLURM and it is the way to fix this, but we still not have an ETA, apologies for the delay.
my first instinct would be to only use a single task and use
Trainer(accelerator="gpu", devices=4, strategy="ddp")
in your train.py to spin up the parallel GPU processes. You will need to putwandb.init()
in an if block like this: From: https://github.com/wandb/wandb/issues/5695#issuecomment-1587958030
I tried this way, but since I'm using Python CLI and if we put only wandb_init in rank 0 the next sweep agent cannot initial in multiple GPUs...
I'm trying to register SLURM nodes as agents for sweeps. I'm using Pytorch Lightning with DDP and multiple GPUs. Following the recommendations from Pytorch Lightning (here), my SLURM sbatch script is something like below.
My jobs require multiple gpus (4 in this example) to run. Note that Pytorch Lightning requires running commands with
srun
and settingntasks=ngpus
when using multiple gpus. These combined cause thesrun
lines to be runntasks
times, essentially creatingntasks
processes running in parallel.The problem I'm having with sweeping is since
wandb agent $sweep_id
gets ranntasks
times, it createsntasks
agents each running a separate configuration from the sweep. Furthermore, this causes Lightning DDP to not bind them together which restricts each agent to only having 1 GPU. This would actually be fine if 1 GPU per agent was enough, but I need all gpus.There are some potential solutions I thought of, but they have their downsides:
Setting
ntasks=1
and using ddp_spawn for the Lightning trainer strategy. This waywandb agent
only gets called once, and the appropriate number of processes are spawned for training. The downside is ddp_spawn is widely discouraged for performance reasons.Switching from CLI to python wandb, and registering the agent from inside myprogram.py. The python wandb seems more flexible, but I haven't used it so I don't actually know if this would work. The downside is I'd prefer to stick to wandb CLI.
The behavior I'd like to see is
wandb agent $sweep_id
running the same program + hyperparameters in parallelntasks
times, so that Lightning DDP can bind them together and use all gpus. I'm wondering if there is a way to accomplish this with the wandb CLI, for example likewandb agent <agent_id>
so that the multiple calls towandb agent
get linked to the same agent.