Zero mAP, Precision, Recall

fatihbaltaci commented 5 years ago

@glenn-jocher Hi, when I run test.py with latest.pt or best.pt; mAP, precision, and recall became zero. But there is no problem with official yolo weights (yolov3.weights).

My training command is: python train.py --img-size=608

Is there any problem with model saving?

glenn-jocher commented 5 years ago

@fatihbaltaci train longer. Darknet trains to 270 epochs. All the tutorials use detect.py with latest.pt to plot the results.

fatihbaltaci commented 5 years ago

@glenn-jocher So, is it normal that mAP, P and R are exactly 0 after 13 epoch? Before multi-gpu support with single gpu I got 0.361 mAP after 3 epoch.

glenn-jocher commented 5 years ago

@fatihbaltaci you need to use the latest commit. Your screenshot is from a very out of date version of this repo.

glenn-jocher commented 5 years ago

@fatihbaltaci this appears connected to the new dataloader. See #141.

glenn-jocher commented 5 years ago

Current workaround is to set num_workers to 0 until we can debug further.

glenn-jocher commented 5 years ago

@fatihbaltaci this may be fixed now, though I've set num_workers=0 as the default out of caution.

perry0418 commented 5 years ago

@fatihbaltaci I face the similar situation, while training my own datasets after 100 epoch the recall and map still zero. However, when I test another yolo v3 project deployed by tensorflow frame, 20 epoch is enough to gain a not bad results. I guess the optimizer may be the reason of slowly converge. Now, I am testing to fine tune the optimizer hyper-parameter, like using adam algorithm and add the weight_decay and so on. I will post my test result if I can find a way to fast converge.

fatihbaltaci commented 5 years ago

@glenn-jocher I tried with the latest commit (943db40f1ad600cdcdc40ff06588cd9c9bf2523f) (num_workers=4). It gives again zero mAP. I tried with num_workers=0 and it gives zero mAP also.

@perry0418 I am training YOLOv3 with COCO2014 dataset. I think slow convergence is not the issue for the zero mAP. Did you try with num_workers=0 ?

fatihbaltaci commented 5 years ago

@glenn-jocher I tried on single GPU and there is no problem, mAP is not zero.

I changed multi-gpu to single-gpu with this command:

export CUDA_VISIBLE_DEVICES=1

glenn-jocher commented 5 years ago

@fatihbaltaci that's so strange. Yes that is the latest commit. I ran two epochs on a GCP instance to test the commit right before this one, and got this below. I used num_workers=4 for train.py but used num_workers=0 in test.py. I saw normal results with num_workers=4 using test.py with yolov3.weights, so I assumed independenty it was working fine as well, hence the switch to set both defaults to 4 workers. My test was single GPU.

@perry0418 The expected mAP should be around 0.15 to 0.25 for the first couple epochs. It is this high because we start from the darknet 53 backbone, not from scratch. See the README for training plots.

   0/269   7328/7328     0.217     0.744      8.78      2.02      11.8       109     0.391      0.187      0.205      0.186
   1/269   7328/7328     0.191     0.489      2.67      1.27      4.61       103     0.393      0.267      0.281       0.26
...

fatihbaltaci commented 5 years ago

@glenn-jocher With single gpu, there is no problem. Did you try training with multi-gpu?

glenn-jocher commented 5 years ago

@fatihbaltaci everything works as intended as long as you limit yourself to 1 GPU. I'm on travel today so won't be able to debug further until tomorrow. The issue probably has to do with the padding/collating of the targets (DataLoader requires all 16 target Tensors to have the same shape). I suspect its possible the targets are not arriving in the correct order under multiGPU + multiple workers.

For now 1 GPU + 4 workers is the fastest correct implementation (the current commit default).

glenn-jocher commented 5 years ago

I'm debugging this, but its pretty difficult because we don't have multiple GPU machines on premise, we use GCP in this case. I created a https://github.com/ultralytics/yolov3/tree/multi_gpu branch with updates, and tried adding a custom collate function to the dataloader, which works great, but did not solve the problem. I also saved the images and targets for the first few batches, and the multiGPU overlays appear identical to the single GPU overlays. Additionally the multi and single GPU losses seem to track each other well, though multi-GPU --resume shows drastically different losses than single GPU resume. test.py continues to operate correctly with yolov3.weights under all GPU/worker combinations, so the issue appears isolated to training rather than testing.

Here are the first 3 batches of overlays using 2 GPUs on GCP. The targets appear to line up correctly on the image, which negates my first idea that there was a target ordering mismatch. I remain unsure of where the problems stems, though it is isolated to multiple GPU training. BATCH0 batch_0 BATCH1 batch_1 BATCH2 batch_2

perry0418 commented 5 years ago

@glenn-jocher @fatihbaltaci I'm glad to tell you guys my latest debug has solved this problem. This problem is caused by the wrong input format of dataloader. when we use the multiple GPU and processes, the pytorch office example is recommended to use torch.nn.parallel.DistributedDataParallel and torch.utils.data.distributed.DistributedSampler(https://github.com/pytorch/examples/blob/master/imagenet/main.py). below is my test results, with 4 num_works and 2 GPU and resume training. later, I will submit a pull request. 123

perry0418 commented 5 years ago

@fatihbaltaci @glenn-jocher
here is my modified repository address which has solved the problem mentioned above. https://github.com/perry0418/yolov3

glenn-jocher commented 5 years ago

@perry0418 thanks for the PR! I merged it, but I get an error on the merge, and on your local repo as well:

rm -rf yolov3 && git clone https://github.com/perry0418/yolov3
cd yolov3 && python3 train.py

Namespace(accumulate=1, batch_size=16, cfg='cfg/yolov3.cfg', data_cfg='cfg/coco.data', dist_backend='nccl', dist_url='tcp:
//224.66.41.62:23456', epochs=270, img_size=416, multi_scale=False, num_workers=4, rank=-1, resume=False, world_size=-1)
Found 2 GPUs
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)

Traceback (most recent call last):
  File "train.py", line 233, in <module>
    num_workers=opt.num_workers
  File "train.py", line 65, in train
    dist.init_process_group(backend=opt.dist_backend, init_method=opt.dist_url,world_size=opt.world_size, rank=opt.rank)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 88, in _tcp_rendezvous_handler
    raise _error("rank parameter missing")
ValueError: Error initializing torch.distributed using tcp:// rendezvous: rank parameter missing

perry0418 commented 5 years ago

Please set the tcp ip to your machine’s IP and rank=0

glenn-jocher commented 5 years ago

Thanks. What should world_size be set to? Also do you know how world_size relates to num_workers?

fatihbaltaci commented 5 years ago

@perry0418 Thank's for your support. Problem is solved.

0/269 3658/3664 0.225 0.847 12.1 2.26 15.4 202 0.659 0/269 3659/3664 0.225 0.847 12.1 2.26 15.4 164 0.628 0/269 3660/3664 0.225 0.847 12.1 2.26 15.4 259 0.504 0/269 3661/3664 0.225 0.847 12.1 2.26 15.4 226 0.646 0/269 3662/3664 0.225 0.847 12 2.26 15.4 231 0.666 0/269 3663/3664 0.225 0.846 12 2.26 15.4 266 0.654 0/269 3664/3664 0.225 0.846 12 2.26 15.4 110 4.03

Found 2 GPUs Using cuda _CudaDeviceProperties(name='GeForce GTX 1080 Ti', major=6, minor=1, total_memory=11178MB, multi_processor_count=28)
  Image      Total          P          R        mAP
     32       5000      0.149      0.131      0.112       1.73s
     64       5000      0.163      0.157      0.143      0.358s
     96       5000      0.169      0.181      0.168      0.378s
    128       5000      0.162      0.174      0.161      0.345s
    160       5000      0.158      0.169      0.157      0.367s
    192       5000      0.152      0.161      0.148      0.355s
    224       5000      0.152      0.159      0.146      0.382s
    256       5000      0.149      0.157      0.143      0.383s
    288       5000      0.153      0.163      0.148      0.329s
    320       5000      0.148      0.157      0.143      0.356s
    352       5000      0.144      0.152      0.138      0.386s
    384       5000      0.146      0.154       0.14      0.349s
    416       5000      0.145      0.156      0.142      0.401s

perry0418 commented 5 years ago

@glenn-jocher here you can find all you need. https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

ygliang2009 commented 5 years ago

Why i have been updated the newest version of the code, the mAP is stll 0 after severial epochs. Even when i use the yolov3.weights model to test, the situation can't be improved. I use multi-GPU and the newest code.

ygliang2009 commented 5 years ago

WechatIMG99

ygliang2009 commented 5 years ago

Why i have been updated the newest version of the code, the mAP is stll 0 after severial epochs. Even when i use the yolov3.weights model to test, the situation can't be improved. I use multi-GPU and the newest code.

I am sorry, this is my fault. Thanks a lot :->

glenn-jocher commented 5 years ago

@perry0418 thanks a lot for the PR!! It looks like the issue is resolved now, so I am happily closing the issue.

SrealZ commented 5 years ago

@ygliang2009 Get the newest version of this repository may be works for you now. It works on my machine.

oykusahin commented 4 years ago

Why i have been updated the newest version of the code, the mAP is stll 0 after severial epochs. Even when i use the yolov3.weights model to test, the situation can't be improved. I use multi-GPU and the newest code.

I am sorry, this is my fault. Thanks a lot :->

Hi! How did you resolve this issue? I am having the same issue as well.

Many thanks in advance.

glenn-jocher commented 11 months ago

@oykusahin hi there! It's great to hear that you found a resolution to the issue! The YOLO community and the Ultralytics team have put in a lot of effort to address such challenges. If you encounter any further issues, don't hesitate to reach out. Keep up the great work!

ultralytics / yolov3

Zero mAP, Precision, Recall #146