microsoft / SwinBERT

Research code for CVPR 2022 paper "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"
https://arxiv.org/abs/2111.13196
MIT License
237 stars 34 forks source link

raise KeyError(key) from None KeyError: 'RANK' #48

Open ybsu opened 1 year ago

ybsu commented 1 year ago

我在运行vatex部分的training命令,得到了这样的错误,我上网查了下,手动给os.environ['RANK‘]赋值可跳过此错误,但是后面会报错:os.environ['WORLD_SIZE'] key error, 我思考这个问题应该不简单,搞不懂了,请各位大神教我,如何把程序跑通是第一步。。谢谢

File "src/tasks/run_caption_VidSwinBert.py", line 689, in main(args) File "src/tasks/run_caption_VidSwinBert.py", line 675, in main args, vl_transformer, optimizer, scheduler = mixed_precision_init(args, vl_transformer) File "src/tasks/run_caption_VidSwinBert.py", line 105, in mixed_precisioninit model, optimizer, , _ = deepspeed.initialize( File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/init.py", line 129, in initialize dist.init_distributed(dist_backend=dist_backend, dist_init_required=dist_init_required) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend rank = int(os.environ["RANK"]) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/os.py", line 675, in getitem raise KeyError(key) from None KeyError: 'RANK'

Accept-AI commented 1 year ago

我在运行vatex部分的training命令,得到了这样的错误,我上网查了下,手动给os.environ['RANK‘]赋值可跳过此错误,但是后面会报错:os.environ['WORLD_SIZE'] key error, 我思考这个问题应该不简单,搞不懂了,请各位大神教我,如何把程序跑通是第一步。。谢谢

File "src/tasks/run_caption_VidSwinBert.py", line 689, in main(args) File "src/tasks/run_caption_VidSwinBert.py", line 675, in main args, vl_transformer, optimizer, scheduler = mixed_precision_init(args, vl_transformer) File "src/tasks/run_caption_VidSwinBert.py", line 105, in mixed_precisioninit model, optimizer, , _ = deepspeed.initialize( File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/init.py", line 129, in initialize dist.init_distributed(dist_backend=dist_backend, dist_init_required=dist_init_required) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend rank = int(os.environ["RANK"]) File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/os.py", line 675, in getitem raise KeyError(key) from None KeyError: 'RANK'

hello!, 请问您跑通过了吗?解决问题了吗

a7f4123 commented 6 months ago

我在探索多GPU训练。关于这个“RANK”和“WORLD_SIZE”,我能说的就是这是多GPU训练所必需的两个参数; 一般都是以下这样的源代码:

    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ['WORLD_SIZE'])
        args.gpu = int(os.environ['LOCAL_RANK'])
    elif 'SLURM_PROCID' in os.environ:
        args.rank = int(os.environ['SLURM_PROCID'])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print('Not using distributed mode')
        args.distributed = False
        return

我也只能提供这点线索了(笑哭)就是在找怎么解决环境变量没有这两个key才搜到你们的