Open Looong01 opened 4 months ago
@Looong01, it seems your localhost
is not configured for password-less ssh, which is a requirement for DeepSpeed. Please see
https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node
Although you are using a single-node, --autotuning
option operates as if each rank is on a different host. You could try configuring the hostfile to hostnames that works with password-less ssh.
Describe the bug When I want to use
--autotuning run
args to training on single node and 2 RTX 6000 Ada GPUs, it returnslocalhost: Permission denied, please try again.
.To Reproduce Steps to reproduce the behavior:
conda activate PyTorch
$ deepspeed --autotuning run --num_gpus=2 --num_nodes=1 /data/01_Codes/Geneformer/examples/pretraining_new_model/pretrain_geneformer_w_deepspeed.py --deepspeed /data/01_Codes/Geneformer/examples/pretraining_new_model/ds_config.json
Expected behavior Get into Training process
ds_report output
System info (please complete the following information):
torch version is 2.1.1
Launcher context
$ deepspeed --autotuning run --num_gpus=2 --num_nodes=1 /data/01_Codes/Geneformer/examples/pretraining_new_model/pretrain_geneformer_w_deepspeed.py --deepspeed /data/01_Codes/Geneformer/examples/pretraining_new_model/ds_config.json
Additional context Full error messages:
The ds configuration file: