Closed chen-lee-li closed 1 year ago
If you didn't change the training code this command doesn't raise the issue
python -m torch.distributed.launch \
--nproc_per_node number_of_gpus train.py \
--model_path="bigcode/santacoder" \
--dataset_name="bigcode/the-stack-dedup" \
--subset="data/shell" \
--data_column "content" \
--split="train" \
--seq_length 2048 \
--max_steps 30000 \
--batch_size 2 \
--gradient_accumulation_steps 8 \
--learning_rate 5e-5 \
--num_warmup_steps 500 \
--eval_freq 3000 \
--save_freq 3000 \
--log_freq 1 \
--num_workers="$(nproc)" \
If the issue persists can you provide details about your command and library versions?
lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects
--local-rank
argument to be set, pleasechange it to read from
os.environ['LOCAL_RANK']
instead. Seehttps://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.