Closed HoracceFeng closed 5 years ago
OK, I finally find out why it happens.
Actually, when I change the DataParallel
back to DistributedDataParallel
, everything smooths. So the question becomes, what the difference between DataParallel
and DistributedDataParallel
? And why these two apis have differences?
I will close this issue. But still waiting for someone to answer. Thx.
@HoracceFeng DataParallel is used in test.py. DistributedDataParallel is generally more advanced and is used in train.py.
@HoracceFeng also this is more of a pytorch question, you might want to post there.
When I train own data using single GPU, it works, But using two or four GPUs, a mistake raises : AttributeError: 'DataParallel' object has no attribute 'module_list'
@LIUhansen multi-GPU currently operates correctly. For a working multi-GPU environment see https://github.com/ultralytics/yolov3#reproduce-our-environment
OK, I finally find out why it happens.
Actually, when I change the
DataParallel
back toDistributedDataParallel
, everything smooths. So the question becomes, what the difference betweenDataParallel
andDistributedDataParallel
? And why these two apis have differences?I will close this issue. But still waiting for someone to answer. Thx.
Have you addressed this problem?
@yangxu351 Yes, the difference between DataParallel
and DistributedDataParallel
lies in their implementation. While DataParallel
replicates the model in each GPU and uses synchronous parallelism, DistributedDataParallel
performs data parallelism using asynchronous parallelism with multiple processes. Thus, each API serves different multi-GPU training scenarios. If you still have questions, I recommend referring to the official PyTorch documentation for detailed explanations.
Hi @glenn-jocher, this error is pretty weird. I use your code in cpu and single gpu, everything is prefect, but when I try to use multi_gpus, this error occurs:
Traceback (most recent call last): File "train.py", line 299, in train loss, loss_items = compute_loss(pred, targets, model, giou_loss=giou_loss)
File "/code/utils/utils.py", line 340, in compute_loss txy, twh, tcls, tbox, indices, anchor_vec = build_targets(model, targets) File "/code/utils/utils.py", line 398, in build_targets print("check yololayer.ng", model.module.module_list[16][0].ng, model.module.module_list[23][0].ng) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 535, in getattr type(self).name, name)) AttributeError: 'YOLOLayer' object has no attribute 'ng'
I have checked the YOLOLayer, the attr
ng
exists when I define the model using the classDarknet
. But when it runs tocompute_loss
, thebuild_targets
cannot findng
in YoloLayer. It only occurs in multi_gpus. [I also check "model.module.module_list", theng
just disappear, don't know why.]BTW, I try to use multi_gpus but not distributed training, so I replace the
DistributedDataParallel
toDataParallel
, not sure if this change makes the error.Need help. I trap into this bug for a whole day ...