training COCO from scratch guideline

ahmedtalbi commented 5 years ago

Dear Glenn,

Again thank you for this awesome work.

I started training two days ago. I am loading the pretrained weights Darknet53.conv.74 I trained for 20 epochs without any issue so far. I used the default options and a batch-size 32 and neither Multi-scale nor Multi-gpu. So far, I reached 37.5% mAp as you can see it in the picture. trainingfromscratch

I would like to know if you have any advises regarding the training. If you have any parameters that need to be set in a certain way (LR, Optimizer, Multi-scale, data augmentation,...)

Thanks

Kev1nZheng commented 5 years ago

@ahmedtalbi I used the default options, but Multi-GPU to train from scratch. I trained 20+ epoch, but didn't get mAP like you posed. Did you just use python train.py to train it?

ahmedtalbi commented 5 years ago

@Kev1nZheng Hi, try to see if the training works without Multi-gpu. As far as I know, changing a slight thing in the code would result in different results(even the batch size) so try the same configuration with batch size and run for 1 epoch and see

And make sure you are loading the Darknet53 weights

glenn-jocher commented 5 years ago

@Kev1nZheng multi-gpu training is operating correctly. I just tested it on a GCP VM with two P4 GPUs by running our coco_100img.data tutorial.

Single and multi-gpu training results are identical. Strongly recommend you git clone a clean copy of this repository and train first on a small dataset like coco_100img.data before training on the full coco dataset.

python3 train.py --data data/coco_100img.data --nosave
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-spp.cfg', data_cfg='data/coco_100img.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=True, num_workers=4, rank=0, resume=False, transfer=False, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='Tesla P4', total_memory=7611MB)
           device1 _CudaDeviceProperties(name='Tesla P4', total_memory=7611MB)
...

results

ahmedtalbi commented 5 years ago

Hello glenn,

what about the multi-scale flag? Did you test it ? Did you train using it from the first epoch or did you start it later?

glenn-jocher commented 5 years ago

@ahmedtalbi yes sorry, I meant to answer your question, it is just a very involved question at the moment. I've commented to various people in various issues the following, I should really raise a dedicated issue to this topic and shut down the rest as duplicates. I'll do this today and point you to it, but in short we have not been able to train full coco to 273 epochs (as darknet does) due to time and GPU constraints. The loss constants are tuned to those values (4, 8, 1, 64 etc) because hyperparameter searches produced better results at those settings compared to default darknet loss constants of 1.0 across the board. CE is also swapped for BCE is the cls loss because it produced better results, and lastly there is a 0.10 IOU rejection during training in utils.py, another hyperameter we set. I will elaborate on all of these in a seperate post and link you to it in a few hours.

Like I told @Kev1nZheng I strongly suggest you play around with the tutorial datasets first (i.e. coco_100img.data, coco_1000img.data) before committing yourself to full coco training.

glenn-jocher commented 5 years ago

@Kev1nZheng I just remembered, we fixed a bug that was introduced last week causing zero mAP on coco, similar to your results. If you started training a few days ago you are probably training with this bug in the mix, our apologies. See https://github.com/ultralytics/yolov3/issues/197.

The bug is fixed now, if you git pull, or git clone a clean copy your multigpu results should be fine, though again I still recommend you train a few of the tutorials first: https://docs.ultralytics.com/yolov5/tutorials/train_custom_data

glenn-jocher commented 5 years ago

@ahmedtalbi can you post your results.png image? Your results look pretty good. My main worry is the training increases flattening out past epoch 50. Generally aggressive results at the beginning of training come at the expense of the final trained model. A more gradual training will not produce the best results early on, but in general does produce better results later on, if you are patient enough to wait. It really depends on the application and the user's priorities. I would say this repository is currently tuned aggressively to create the best initial response (i.e. in the first 30 epochs) at an unknown cost later on.

The command to plot your results is simply: python3 -c "from utils import utils; utils.plot_results()"

Kev1nZheng commented 5 years ago

@glenn-jocher Thanks for your reply. I git clone your repo yesterday. I traind it on coco_1000img.data, but I use first 500 images in val2014 as test data. Because I think using same images as train set and test set can cause some problems like overfitting. And here is my result: I didn't change any setting like LR, Optimizer,etc. just python train.py. Do you think I should try a smaller LR?

glenn-jocher commented 5 years ago

@Kev1nZheng ah, your results look good. Beware though the LR scheduler reduces LR at epochs 218, 245, i.e. batches 400k, 450k in darknet (from 1e-3 to 1e-4, 1e-5), so training past 273 will not change anything since the LR is barely nonzero. In your case it seems the best test loss was achieved right after the first drop, at epoch 218.

Like I told @ahmedtalbi, there are a variety of hyperparameters you could try tuning, such as the loss constants, the loss criteria, the LR, LR schedule, image augmentation, IOU rejection constant in build_targets(), etc. We've set the current ones based on previous searches, but they could use more tuning. The reason we started the searches was because the default darknet loss function was not performing well in pytorch unfortunately. I will be starting a dedicated issue to track these hyperparameters soon hopefully.

glenn-jocher commented 5 years ago

@ahmedtalbi @Kev1nZheng guys also beware you can use the --nosave flag with train.py to speed things up. Saving latest.pt and best.pt takes a few seconds each epoch, which can make a big difference on smaller datasets like coco_100img.data etc.

Kev1nZheng commented 5 years ago

@glenn-jocher I will use --nosave next time. The best mAP I got is only 0.0948. Is this a reasonable result for coco_1000img (train on 1000 img, test on 500 img in val2014)? I know using same data to train and test can get really hign mAP. But using different data is more rigorous setup.

ahmedtalbi commented 5 years ago

@glenn-jocher thank you for your answer. you were right it does stagnante past hte 50 epochs. I will play arround with the hyperparameters. my goal is to achieve the best mAp possible in order to start other experiments, so any help would be amazing.

here is the results.jpg: results

glenn-jocher commented 5 years ago

@ahmedtalbi oh your results are excellent. There is no stagnation, all the metrics continue to improve. Remember darknet trains to 273 epochs (and includes LR drops past epoch 200 and multiscale), so you are looking at results with none of those extras trained in 20% of the time...

Also remember that train.py requests testing from test.py at --conf_thres 0.1 at each epoch, which results in a lower mAP (but is much faster). If you run python3 test.py --weights weights/latest.pt it will use the default conf of 0.001, which will give you the true higher mAP.

What happened to your test loss??

ahmedtalbi commented 5 years ago

hi @glenn-jocher I am really happy to hear that. Regarding the loss I am not quite sure I just used the plot results as you told me to. Nevertheless, looking at the results.txt the test loss is also going down training1 training2

glenn-jocher commented 5 years ago

@ahmedtalbi oh haha, that's my fault actually, I put code in to clip the max plotted values at 500. If you git pull and replot then you should see the test loss now, my apologies.

Also beware the first 10 lines in your results.txt or so are showing old runs, so you probably want to delete those rows from your results.txt file before plotting.

ahmedtalbi commented 5 years ago

@glenn-jocher yes :)) I see it now, no worries. results I will leave it train for some other epochs and keep you updated. And when do you think it would be the right time to start Multi-scaling? it is actually one crucial point for my future experiments.

glenn-jocher commented 5 years ago

@ahmedtalbi the best thing would be to read all the yolo papers by PJ reddie, which talk a bit about your questions. It's possible some of the results include 3rd party training data in addition to COCO, I don't remember. multiscale is usually applied later on training, but again I don't recall exactly when. Try reading the yolo and yolov2 papers. https://pjreddie.com/publications/

glenn-jocher commented 5 years ago

@Kev1nZheng @ahmedtalbi I recreated your training setup, using coco_1000img.txt for training and a new coco_500val.txt for testing, at 320 and 416 resolutions.

I'm searching for a training scenario that can be iterated quickly for genetic hyperparameter tuning, in which we train fully to 273 epochs, and only test once the end, using the final mAP as the fitness metric. For this tuning, since we only test once, we can make the testing population quite large, even possibly the entire 5000 images, without much time punishment.

I've already been doing this on the much smaller coco_100img.data, but as @Kev1nZheng mentioned, this dataset tests and trains on the same data, so I might be in danger of hyperparameter tuning to overfit at the expense of generalization. You can see the currently genetically evolved hyperparameters from that search now in the latest commit: https://github.com/ultralytics/yolov3/blob/f5d343b9a6f6225e5b2b26a064555c4789f12bf4/utils/utils.py#L260-L274

results

HamsterHuey commented 5 years ago

@glenn-jocher - I've been trying to train Yolo V2 using a separate modified repo on COCO and have also been having similar issues in that my final mAP is ~ 10-15% lower than what I can achieve with the native darknet weights (for COCO) loaded into Pytorch. I've posted my investigations on the issue here: https://gitlab.com/EAVISE/lightnet/issues/17#note_183584645

The author of that repo has verified training to identical mAP for VOC (which I have not independently verified) on Yolo V2. It is surprising that COCO training is so far off the mark though.

Btw, have you noticed this preprocessing on ground-truth bboxes in darknet?

https://github.com/pjreddie/darknet/blob/f86901f6177dfc6116360a13cc06ab680e0c86b0/src/data.c#L475

pjreddie appears to ignore bboxes that are extremely small (from what I can tell, if their width or height is less than 0.1% of the corresponding source image dimension). Unfortunately implementing that didn't really make a big difference though I did note that there are several hundreds of annotations in COCO that would fit that criteria.

There might be a chance that the darknet formulation for SGD with momentum + weight decay is different from how Pytorch implements it, which could also change the training dynamics and make identical values of momentum and weight-decay have different impacts on training. Unfortunately, I've had a hard time making sense of the SGD formulation in Darknet as it relies on a bunch of BLAS axpy style primitives and things get confusing pretty quickly.

I'll be sure to provide any updates if I make any progress on this on my end. Hopefully you are able to sort out the training from scratch issue

glenn-jocher commented 5 years ago

@HamsterHuey yes, this training mismatch is a neverending mystery for me. Inference and testing using detect.py and test.py perform very well, pretty much exactly as expected, but not training.

The minimum box size is something everyone uses though, I don't believe it's related. Google's automl has a minimum 8x8 pixel box size requirement, and in this repo we have several candidate criteria a box has to meet in order to be used, including >4 pixel width and height requirement: https://github.com/ultralytics/yolov3/blob/5f6c2b3d1280f3b23021b11307e1ddec139a93b8/utils/datasets.py#L420-L430

The strangest part of it is that for this repo's loss function to perform optimally it requires some quite extreme hyperparameter settings between the various components of the loss function, something which never seems to be mentioned in darknet. Our lack of resources is also hindering our work, as even on a V100 it takes a full week to fully train on COCO (to 273 epochs). https://github.com/ultralytics/yolov3/blob/5f6c2b3d1280f3b23021b11307e1ddec139a93b8/train.py#L14-L25

HamsterHuey commented 5 years ago

Have you tried scaling the learning rate by dividing it by your batch size (not mini-batch size) and multiplying your weight decay by the batch size? This seems to be the norm in Darknet. I haven't had a whole lot of better luck with that, but I figured I'd mention it.

edit - Reference these 2 lines in darknet: https://github.com/pjreddie/darknet/blob/f6d861736038da22c9eb0739dca84003c5a5e275/src/convolutional_kernels.cu#L313

HamsterHuey commented 5 years ago

@glenn-jocher - Another thing I noticed that may cause differences between darknet and pytorch implementations is that it appears that in Darknet, weight decay is not applied to BatchNorm layers:

https://github.com/pjreddie/darknet/blob/f86901f6177dfc6116360a13cc06ab680e0c86b0/src/batchnorm_layer.c#L157

However, in Pytorch, the default is to apply weight decay to all weights and biases, including those of BN layers. I believe you can used the named_parameters functionality to filter out the BN layer weights and biases and not pass them to the optimizer to prevent WD being applied to BN layers. Not sure how much of an impact that will have on the training comparison between Darknet and this repo.

glenn-jocher commented 5 years ago

@HamsterHuey yes you are absolutely correct about the weight decay not being applicable to BN layers. Do you have pytorch code you've used in the past to sort this out?

HamsterHuey commented 5 years ago

@glenn-jocher - I haven't yet experimented with this myself, but I think these 2 links will be useful and have implementations that would achieve what would be needed. I haven't seen one using named_parameters as referred to by soumith from FB, in the first link below, but I can see how one could use that to achieve something similar.

https://github.com/pytorch/pytorch/issues/1402

https://discuss.pytorch.org/t/weight-decay-in-the-optimizers-is-a-bad-idea-especially-with-batchnorm/16994/3

I'm a bit caught up with other things (and also currently have another COCO train run ongoing with YoloV2) but I do plan to revisit with trying to disable weight-decay for Batchnorm layers to see what impact that has. I'll be curious to see if it helps in any way with your repo.

Edit - This link may also be useful:

https://github.com/pytorch/pytorch/issues/15067

sanazss commented 5 years ago

I am training the model on coco dataset but I get an error when I run the test and training. It stops after one epoch for training and I get an index error message for number of batches (nb) that: index -1 is out of bounds for axis 0 with size 0. nb = bi[-1]+1. Would you please let me know of this problem? and how to fix it?

ahmedtalbi commented 5 years ago

Hello Guys, I have been experimenting on knowledge distillation methods using this repo. One thing we noticed is that filtring boxes with objectness less than 0.5 results in an improvement. Since in this repo, the objectness filtering and the one anchor-box are not yet implemented it could be that those will result in the last few maps to catch the original code. I am planning to implement it myself but have been postponing it for two months already. If someone needs to achieve the same results as the original code i think this is an important part.

sanazss commented 5 years ago

I have a question about test.py. Why we should include its results in training while testing is usually after training data. Also looking at coco.data it uses similar images for both. I split my data to test and train. I am running the repo on my dataset but each time I face an error. Is it because of splitting data? removing the test.test from training cause this error. I appreciate your comments.

glenn-jocher commented 5 years ago

@sanazss test.py runs after each epoch to determine mAP on the test set. You can run the python3 train.py --notest to avoid testing during training.

glenn-jocher commented 5 years ago

@sanazss use the Google Colab example to get started. Run a working training example first and then go from there. If the batch is empty your images are not being found.

ktian08 commented 5 years ago

@ahmedtalbi Could you clarify what you mean by the one-anchor box? Is this in the YOLOv3 paper?

glenn-jocher commented 5 years ago

Closing as duplicate, this convo has moved to of https://github.com/ultralytics/yolov3/issues/310.

ultralytics / yolov3

training COCO from scratch guideline #205