ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.18k stars 3.44k forks source link

Error occurs when using multi_gpu train #447

Closed HoracceFeng closed 5 years ago

HoracceFeng commented 5 years ago

Hi @glenn-jocher, this error is pretty weird. I use your code in cpu and single gpu, everything is prefect, but when I try to use multi_gpus, this error occurs:

Traceback (most recent call last): File "train.py", line 299, in train loss, loss_items = compute_loss(pred, targets, model, giou_loss=giou_loss)
File "/code/utils/utils.py", line 340, in compute_loss txy, twh, tcls, tbox, indices, anchor_vec = build_targets(model, targets) File "/code/utils/utils.py", line 398, in build_targets print("check yololayer.ng", model.module.module_list[16][0].ng, model.module.module_list[23][0].ng) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 535, in getattr type(self).name, name)) AttributeError: 'YOLOLayer' object has no attribute 'ng'

I have checked the YOLOLayer, the attr ng exists when I define the model using the class Darknet. But when it runs to compute_loss, the build_targets cannot find ng in YoloLayer. It only occurs in multi_gpus. [I also check "model.module.module_list", the ng just disappear, don't know why.]

BTW, I try to use multi_gpus but not distributed training, so I replace the DistributedDataParallel to DataParallel, not sure if this change makes the error.

Need help. I trap into this bug for a whole day ...

HoracceFeng commented 5 years ago

OK, I finally find out why it happens.

Actually, when I change the DataParallel back to DistributedDataParallel, everything smooths. So the question becomes, what the difference between DataParallel and DistributedDataParallel? And why these two apis have differences?

I will close this issue. But still waiting for someone to answer. Thx.

glenn-jocher commented 5 years ago

@HoracceFeng DataParallel is used in test.py. DistributedDataParallel is generally more advanced and is used in train.py.

glenn-jocher commented 5 years ago

@HoracceFeng also this is more of a pytorch question, you might want to post there.

LIUhansen commented 4 years ago

When I train own data using single GPU, it works, But using two or four GPUs, a mistake raises : AttributeError: 'DataParallel' object has no attribute 'module_list'

glenn-jocher commented 4 years ago

@LIUhansen multi-GPU currently operates correctly. For a working multi-GPU environment see https://github.com/ultralytics/yolov3#reproduce-our-environment

Screen Shot 2019-12-12 at 10 17 30 AM
yangxu351 commented 4 years ago

OK, I finally find out why it happens.

Actually, when I change the DataParallel back to DistributedDataParallel, everything smooths. So the question becomes, what the difference between DataParallel and DistributedDataParallel? And why these two apis have differences?

I will close this issue. But still waiting for someone to answer. Thx.

Have you addressed this problem?

glenn-jocher commented 11 months ago

@yangxu351 Yes, the difference between DataParallel and DistributedDataParallel lies in their implementation. While DataParallel replicates the model in each GPU and uses synchronous parallelism, DistributedDataParallel performs data parallelism using asynchronous parallelism with multiple processes. Thus, each API serves different multi-GPU training scenarios. If you still have questions, I recommend referring to the official PyTorch documentation for detailed explanations.