ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.18k stars 3.44k forks source link

Explaination on Why NMS Is Slow During Custom Dataset Training And How to Solve it #117

Closed wtiandong closed 5 years ago

wtiandong commented 5 years ago

In issue #112 , @iadfy says that NMS is slow when training a yolov3 network with just two classes. This happen to me too. It takes 2 seconds to forward and backward a batch but 99 seconds to NMS the same batch. @iadfy says that it seems that smaller number of classes with smaller filters takes longer time to testing but i have no idea where to figure out... any idea would be helpful. Smaller filters takes longer time is just a result, not the cause.

The real cause is NMS is a slow algorithm with time complexity of O(N^2). When you training a new network, the weights are just random initialized. The output of YOLOLayer is just random number centered at 0.5. NMS threshold for testing during training is 0.3, so thousands of bounding box candidates needs for NMS, depend on your input image size and configuration of YOLOLayer Input. And what's worse, the candidates IOU is small with a large probability due to randomization, so the cancellation of high IOU boxes won't work. All these lead to wasting O(N^2) time on NMS on meaningless random box candidates.

How to solve it? Well you can just skip testing of few epoch on the begining. Here's the code:

# Calculate mAP
with torch.no_grad():
    if(epoch > 40):
        mAP, R, P = test.test(cfg, data_cfg, weights=latest, batch_size=batch_size, img_size=img_size)

# Write epoch results
if(epoch > 40):
    with open('results.txt', 'a') as file:
        file.write(s + '%11.3g' * 3 % (mAP, P, R) + '\n')

On train.py, line 169 to 178.

Good Luck.

glenn-jocher commented 5 years ago

@wtiandong yes this is a good summary of probable causes. It occured to me as well that the SGD burn-in functionality might also be causing this, since it is hard-coded for a slow LR ramp during the first epoch, from lr=0.0 to lr=0.001 over the first 1000 batches (like darknet). If custom datasets have <1000 batches they parameters may not update substantially by the time NMS needs to be run during testing the epoch 0 results. I've added code to dynamically adapt the burn-in length now to a max of 1/5 of the number of batches in the first epoch. https://github.com/ultralytics/yolov3/blob/6fb14fc903fcea68239541a3de0fca5e6dc036e7/train.py#L90

I've also switched the default NMS method from MERGE (slightly more accurate, but possibly slightly slower), to OR, the default method used by darknet. https://github.com/ultralytics/yolov3/blob/6fb14fc903fcea68239541a3de0fca5e6dc036e7/utils/utils.py#L382

Hopefully these changes help alleviate the NMS speed issue on custom datasets.

glenn-jocher commented 5 years ago

FYI the latest commit may improve NMS speed a bit. We moved some of the operations from torch tensors to python lists, which resulted in about a 20% NMS speedup. There doesn't seem to be any more optimization possible within PyTorch or Python, so we are going to close this issue. @wtiandong thank you for the insightful comments.

iadfy commented 5 years ago

@wtiandong, @glenn-jocher Many thanks for the kind explanation. feed back was really really helpful. I really appreciate both of you. Thanks again.

glenn-jocher commented 11 months ago

@iadfy you're very welcome! 😊 I'm glad to hear the explanation was helpful. Thanks for your appreciation, but the credit truly goes to the YOLO community and the Ultralytics team for their collaborative efforts to improve the model. If you have any more questions or need further assistance, feel free to reach out. Good luck with your project!