Open ybsu opened 1 year ago
我在运行vatex部分的training命令,得到了这样的错误,我上网查了下,手动给os.environ['RANK‘]赋值可跳过此错误,但是后面会报错:os.environ['WORLD_SIZE'] key error, 我思考这个问题应该不简单,搞不懂了,请各位大神教我,如何把程序跑通是第一步。。谢谢
File "src/tasks/run_caption_VidSwinBert.py", line 689, in main(args) File "src/tasks/run_caption_VidSwinBert.py", line 675, in main args, vl_transformer, optimizer, scheduler = mixed_precision_init(args, vl_transformer) File "src/tasks/run_caption_VidSwinBert.py", line 105, in mixed_precisioninit model, optimizer, , _ = deepspeed.initialize( File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/init.py", line 129, in initialize dist.init_distributed(dist_backend=dist_backend, dist_init_required=dist_init_required) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend rank = int(os.environ["RANK"]) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/os.py", line 675, in getitem raise KeyError(key) from None KeyError: 'RANK'
hello!, 请问您跑通过了吗?解决问题了吗
我在探索多GPU训练。关于这个“RANK”和“WORLD_SIZE”,我能说的就是这是多GPU训练所必需的两个参数; 一般都是以下这样的源代码:
if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
args.rank = int(os.environ["RANK"])
args.world_size = int(os.environ['WORLD_SIZE'])
args.gpu = int(os.environ['LOCAL_RANK'])
elif 'SLURM_PROCID' in os.environ:
args.rank = int(os.environ['SLURM_PROCID'])
args.gpu = args.rank % torch.cuda.device_count()
else:
print('Not using distributed mode')
args.distributed = False
return
我也只能提供这点线索了(笑哭)就是在找怎么解决环境变量没有这两个key才搜到你们的
我在运行vatex部分的training命令,得到了这样的错误,我上网查了下,手动给os.environ['RANK‘]赋值可跳过此错误,但是后面会报错:os.environ['WORLD_SIZE'] key error, 我思考这个问题应该不简单,搞不懂了,请各位大神教我,如何把程序跑通是第一步。。谢谢
File "src/tasks/run_caption_VidSwinBert.py", line 689, in
main(args)
File "src/tasks/run_caption_VidSwinBert.py", line 675, in main
args, vl_transformer, optimizer, scheduler = mixed_precision_init(args, vl_transformer)
File "src/tasks/run_caption_VidSwinBert.py", line 105, in mixed_precisioninit
model, optimizer, , _ = deepspeed.initialize(
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/init.py", line 129, in initialize
dist.init_distributed(dist_backend=dist_backend, dist_init_required=dist_init_required)
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed
init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method)
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend
rank = int(os.environ["RANK"])
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/os.py", line 675, in getitem
raise KeyError(key) from None
KeyError: 'RANK'