Darknet-PyTorch reproducibility problems for custom dataset (incompatibility in mAP calculation)

pedbrgs commented 4 years ago

Hi, guys! Hi, Glenn! Firstly, I would like to congratulate you for the incredible work of keeping the repository improving constantly.

Problem 1: Reproduce Darknet performance training in Pytorch

Well, I'm using AlexeyAB's Darknet repository in parallel with that repository. I'm using the same hyperparameters and configuration files (.cfg, .data, train.txt, valid.txt) in both versions, but I can't reproduce the Darknet results in PyTorch Ultralytics with my own dataset (two classes). Yes, I followed the tutorial for custom data, my repository is up to date and I also tried the evolve.sh routine.

However, when I'm training with a initial lr 0.001, SGD, momentum 0.9, weight_decay 0.0005, burn-in 1000 (darknet default) and darknet53.conv.74 weights, the mAP stays at 0% for some epochs and grows extremely slowly and oscillating. In addition, Darknet training lasts about 24 hours (8000 iterations) and here in this repository it lasts more than twice and still does not have the same performance. My dataset has about 24,000 images.

Problem 2: mAP@0.50 evaluation on Darknet differs from mAP@0.50 on PyTorch for the same weight

The main problem I encounter is related to the calculation of the mAP metric, in which there is no correspondence between the two repositories. When I convert a Darknet weight (.weights) whose mAP@0.50 (101 points) is 63.39% for PyTorch (.pth) and I evaluate it, I get 46.8%. The precision and recall also remain different, just the F1-score that seems right to me.

I know that for custom data the best way is to calculate mAP@0.5 with 0 points, but I tried to use the same amount of points used in this repository for comparison.

PascalVOC 2010-2012 and IMAGEnet uses each unique point on Precision-Recall curve -. Ie calculates Area Under Curve without approximation. Source: mAP (mean average precision) calculation for different Datasets (MSCOCO, ImageNet, PascalVOC) #2746

I thought the problem was the conversion, but when I convert Darknet to PyTorch and then PyTorch to Darknet, I remain with the same mAP, without depreciation.

Conversion:

from models import * convert('cfg/yolo.cfg', 'weights/best.pt')

I tried to vary the values of all existing thresholds: conf_thres, iou_thres and pr_score, according to the curves below. But I was not successful in finding values that would make me achieve the same performance as Darknet.

pr_score analysis:

conf_thres analysis:

iou_thres analysis:

Can someone help me get at least the match between the mAP calculations from both repositories?

I'm sorry if it seems like I don't know YOLO's theory very well, I'm learning it now. Thank you very much for any help that comes!

github-actions[bot] commented 4 years ago

Hello @pedbrgs, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

glenn-jocher commented 4 years ago

@pedbrgs hi there! Regarding mAP computation:

Each repo will probably return slightly different results using the same exact model. The best official mAP computation is pycocotools, though this will only work on COCO data. You can see that pycocotools and this repo differ within about 1-2% in mAP computations, and I would not be surprised in AlexeyAB differs by a similar amount on the same exact model. In your chart you show 63% and 47% mAPs. Are these using the same exact model?
Using this repo you can pass .weights and .pt models both equally to test.py, though these should return you exactly identical mAPs if they are simple conversions of each other. For example yolov3-spp.pt and yolov3-spp.weights will give you identical mAPs on COCO.
In terms of model training, this will be completely different between the two repos, and I would not be surprised if your results vary substantially depending on your dataset and training settings, even for the same exact *.cfg files.

pedbrgs commented 4 years ago

@glenn-jocher Wow, thanks for the quick answer!

1. mAP computation for custom data

Yes, exactly because it is reported that there is a slight difference between the mAPs that I was in doubt. I took my weight that was trained on Darknet with 63.39%:

./darknet detector map custom.data custom.cfg custom.weights -points 101 -thresh 0.25 -iou_thresh 0.5

I converted it to PyTorch Ultralytics and got 46.8%:

python3 test.py --cfg custom.cfg --data custom.data --weights custom.weights --batch-size 1 --conf-thres 0.25 --iou-thres 0.5 --task test

I just wanted to check if they would give the same value, since it is the same weight file. Considering the same confidence threshold (conf_thres), same IoU threshold (iou_thres), same batch size (batch_size), same configuration file (.cfg, .data) and same validation set (valid.txt). As both metrics deal with mAP@0.5 with 101 points, why are the values different? That's my question. Shouldn't they be the same since the model is the same (.cfg, .data and .weights)? Yes, I am using the same model on all charts. It was an attempt to find the same 63.39%, since with the same thresholds (iou_thres = 0.5 and conf_thres = 0.25) the result was totally different.

2. Training with custom data

Well, then, you cannot expect to get the same result for the same hyperparameters, since the implementations are different. But I can still achieve a similar performance by training in PyTorch Ultralytics, right? But with other hyperparameters. Is that what you mean? The best I got was mAP@0.5=54.47% with very different hyperparameters. The learning rate 10 times lower than the default Darknet, Adamax optimizer, etc. I'm not using YOLOv3-SPP, is it better?

Thank you so much again!

glenn-jocher commented 4 years ago

@pedbrgs a true mAP computation evaluates the area under the entire P-R curve, which essentially means you use --conf 0.0. To save time, and to approximate this effect, people occasionally use higher --conf thresholds, like 0.01 etc, 0.10 etc, but 0.25 is way too high. This will only evaluate a small part of the PR curve and give you erroneous mAP results. So to produce the most accurate mAP value I recommend you do not change the default settings in this repo when running test.py.

For best results we recommend you leave everything alone and train with default settings (including the default cfg and pretrained weights already in train.py.).

pedbrgs commented 4 years ago

@glenn-jocher

I think I confused the nomenclatures used in the two repositories. What they call conf_thres is the pr_score in this repository, since changing it only affects precision and recall (it does not affect mAP). I made this confusion because when running Darknet with the command line below:

./darknet detector map custom.data custom.cfg custom.weights -points 101 -thresh 0.25 -iou_thresh 0.5

I receive as output:

for conf_thresh = 0.25, precision = 0.67, recall = 0.60, F1-score = 0.63 for conf_thresh = 0.25, TP = 946, FP = 467, FN = 640, average IoU = 48.74% IoU threshold = 50%, used 101 Recall-points mean average precision (mAP@0.50) = 0.634394, or 63.44% Total Detection Time: 24 Seconds

The conf_thres here is called thresh in the Darknet repository and is set to 0.005. It makes more sense now, right?

glenn-jocher commented 4 years ago

@pedbrgs ah yes that makes more sense then.

In that case, whatever mAP you see when you run test.py with default settings should be correct here to within 1-2%, as we have validated our mAP against pycocotools using COCO results.

The precision, recall and F1 do not matter, they are not properties of your model, instead they are tunable (using pr_score) to whatever precision or recall you want depending on your application.

Only mAP is an unchangeable property of your model.

pedbrgs commented 4 years ago

@glenn-jocher Now I understand. Thank you! The threshold names really confused me a lot.

I finally got mAP@0.5 close to 61% (but for slightly different thresholds). Can you tell me which non-maximum suppression method is used in the Darknet repository? In this repository are several methods and I don't know which one to use: vision or merge.

glenn-jocher commented 4 years ago

@pedbrgs the default settings produce the COCO mAPs shown in the README here. I'm not sure about darknet.

pedbrgs commented 4 years ago

Thanks, Glenn!

glenn-jocher commented 11 months ago

@pedbrgs you're welcome!

ultralytics / yolov3

Darknet-PyTorch reproducibility problems for custom dataset (incompatibility in mAP calculation) #969

Problem 1: Reproduce Darknet performance training in Pytorch

Problem 2: mAP@0.50 evaluation on Darknet differs from mAP@0.50 on PyTorch for the same weight