Open Heavenbest opened 2 years ago
您好, 看上去是在validation的时候用到了分布式训练的一些参数导致的问题,
单卡训练可以考虑使用
python -m torch.distributed.launch --nproc_per_node=1 --master_port=4321 basicsr/train.py -opt _PATH_TO_YOUR_CONFIG_ --launcher pytorch
避免特殊处理的麻烦
谢谢您对NAFNe的关注
您好, 看上去是在validation的时候用到了分布式训练的一些参数导致的问题,
单卡训练可以考虑使用
python -m torch.distributed.launch --nproc_per_node=1 --master_port=4321 basicsr/train.py -opt _PATH_TO_YOUR_CONFIG_ --launcher pytorch
避免特殊处理的麻烦
谢谢您对NAFNe的关注
你好,因为你们团队拥有更好及更多的GPU,我们普通学生实验室只有单卡的情况下如何调整呢? 在单卡训练是否需要对迭代次数iters及学习率lr进行调整,你们是8卡训练,请问我将iters*8,学习率除以8, batch size根据GPU显存增大,请问这样处理是否正确呢?
请问单卡训练的问题解决了吗,我也出现了上述问题。但是作者提供的语句python -m torch.distributed.launch --nproc_per_node=1 --master_port=4321 basicsr/train.py -opt _PATH_TO_YOURCONFIG --launcher pytorch我不是很懂。
请问单卡训练的问题解决了吗,我也出现了上述问题。但是作者提供的语句python -m torch.distributed.launch --nproc_per_node=1 --master_port=4321 basicsr/train.py -opt _PATH_TO_YOURCONFIG --launcher pytorch我不是很懂。
PATH_TO_YOUR_CONFIG 就是你的配置文件 按照原来的就可以
2022-08-18 15:41:56,733 INFO: Dataset PairedImageDataset - gopro-test is created. 2022-08-18 15:41:56,733 INFO: Number of val images/folders in gopro-test: 1194 .. cosineannealingLR 2022-08-18 15:41:59,165 INFO: Model [ImageRestorationModel] is created. 2022-08-18 15:41:59,321 INFO: Start training from epoch: 0, iter: 0 2022-08-18 15:42:54,563 INFO: [NAFNe..][epoch: 0, iter: 200, lr:(9.998e-04,)] [eta: 1:30:56, time (data): 0.272 (0.003)] l_pix: -3.8901e+01 2022-08-18 15:43:49,139 INFO: [NAFNe..][epoch: 0, iter: 400, lr:(9.990e-04,)] [eta: 1:29:35, time (data): 0.272 (0.003)] l_pix: -4.1002e+01 2022-08-18 15:44:44,883 INFO: [NAFNe..][epoch: 0, iter: 600, lr:(9.978e-04,)] [eta: 1:29:08, time (data): 0.275 (0.003)] l_pix: -4.7450e+01 2022-08-18 15:45:41,114 INFO: [NAFNe..][epoch: 0, iter: 800, lr:(9.961e-04,)] [eta: 1:28:39, time (data): 0.275 (0.003)] l_pix: -5.7533e+01 2022-08-18 15:46:37,352 INFO: [NAFNe..][epoch: 1, iter: 1,000, lr:(9.939e-04,)] [eta: 1:27:59, time (data): 0.278 (0.005)] l_pix: -5.5011e+01 2022-08-18 15:35:10,613 WARNING: nondist_validation is not implemented. Run dist_validation. Test train_image_0063_s072: 100%|████████████████████████████████████████████████████████| 1194/1194 [02:23<00:00, 8.32image/s] Traceback (most recent call last): File "/data/pengshan/deblur_image/NAFNet-main/basicsr/train.py", line 305, in main() File "/data/pengshan/deblur_image/NAFNet-main/basicsr/train.py", line 270, in main model.validation(val_loader, current_iter, tb_logger, File "/data/pengshan/deblur_image/NAFNet-main/basicsr/models/base_model.py", line 57, in validation return self.nondist_validation(dataloader, current_iter, tb_logger, File "/data/pengshan/deblur_image/NAFNet-main/basicsr/models/image_restoration_model.py", line 385, in nondist_validation self.dist_validation(*args, **kwargs) File "/data/pengshan/deblur_image/NAFNet-main/basicsr/models/image_restoration_model.py", line 365, in dist_validation torch.distributed.reduce(metrics, dst=0) File "/data/pengshan/miniconda3/envs/pytorch-blur/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1469, in reduce default_pg = _get_default_group() File "/data/pengshan/miniconda3/envs/pytorch-blur/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
使用单个GPU训练,使用的是我自己的数据集,出现上述错误,麻烦帮忙看下是哪里的问题。 这个能使用单卡训练吗?
我也遇到了,修改参数后,但没用
2022-08-18 15:41:56,733 INFO: Dataset PairedImageDataset - gopro-test is created. 2022-08-18 15:41:56,733 INFO: Number of val images/folders in gopro-test: 1194 .. cosineannealingLR 2022-08-18 15:41:59,165 INFO: Model [ImageRestorationModel] is created. 2022-08-18 15:41:59,321 INFO: Start training from epoch: 0, iter: 0 2022-08-18 15:42:54,563 INFO: [NAFNe..][epoch: 0, iter: 200, lr:(9.998e-04,)] [eta: 1:30:56, time (data): 0.272 (0.003)] l_pix: -3.8901e+01 2022-08-18 15:43:49,139 INFO: [NAFNe..][epoch: 0, iter: 400, lr:(9.990e-04,)] [eta: 1:29:35, time (data): 0.272 (0.003)] l_pix: -4.1002e+01 2022-08-18 15:44:44,883 INFO: [NAFNe..][epoch: 0, iter: 600, lr:(9.978e-04,)] [eta: 1:29:08, time (data): 0.275 (0.003)] l_pix: -4.7450e+01 2022-08-18 15:45:41,114 INFO: [NAFNe..][epoch: 0, iter: 800, lr:(9.961e-04,)] [eta: 1:28:39, time (data): 0.275 (0.003)] l_pix: -5.7533e+01 2022-08-18 15:46:37,352 INFO: [NAFNe..][epoch: 1, iter: 1,000, lr:(9.939e-04,)] [eta: 1:27:59, time (data): 0.278 (0.005)] l_pix: -5.5011e+01 2022-08-18 15:35:10,613 WARNING: nondist_validation is not implemented. Run dist_validation. Test train_image_0063_s072: 100%|████████████████████████████████████████████████████████| 1194/1194 [02:23<00:00, 8.32image/s] Traceback (most recent call last): File "/data/pengshan/deblur_image/NAFNet-main/basicsr/train.py", line 305, in
main()
File "/data/pengshan/deblur_image/NAFNet-main/basicsr/train.py", line 270, in main
model.validation(val_loader, current_iter, tb_logger,
File "/data/pengshan/deblur_image/NAFNet-main/basicsr/models/base_model.py", line 57, in validation
return self.nondist_validation(dataloader, current_iter, tb_logger,
File "/data/pengshan/deblur_image/NAFNet-main/basicsr/models/image_restoration_model.py", line 385, in nondist_validation
self.dist_validation(*args, **kwargs)
File "/data/pengshan/deblur_image/NAFNet-main/basicsr/models/image_restoration_model.py", line 365, in dist_validation
torch.distributed.reduce(metrics, dst=0)
File "/data/pengshan/miniconda3/envs/pytorch-blur/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1469, in reduce
default_pg = _get_default_group()
File "/data/pengshan/miniconda3/envs/pytorch-blur/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
使用单个GPU训练,使用的是我自己的数据集,出现上述错误,麻烦帮忙看下是哪里的问题。 这个能使用单卡训练吗?