ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.13k stars 3.43k forks source link

CUSTOM TRAINING EXAMPLE (OLD) #192

Closed glenn-jocher closed 3 years ago

glenn-jocher commented 5 years ago

This guide explains how to train your own custom dataset with YOLOv3.

Before You Start

Clone this repo, download COCO dataset, and install requirements.txt dependencies, including Python>=3.7 and PyTorch>=1.4.

git clone https://github.com/ultralytics/yolov3
bash yolov3/data/get_coco2017.sh  # 19GB
cd yolov3
pip install -U -r requirements.txt

Train On Custom Data

1. Label your data in Darknet format. After using a tool like Labelbox to label your images, you'll need to export your data to darknet format. Your data should follow the example created by get_coco2017.sh, with images and labels in separate parallel folders, and one label file per image (if no objects in image, no label file is required). The label file specifications are:

Each image's label file must be locatable by simply replacing /images/*.jpg with /labels/*.txt in its pathname. An example image and label pair would be:

../coco/images/train2017/000000109622.jpg  # image
../coco/labels/train2017/000000109622.txt  # label

An example label file with 5 persons (all class 0):

Screen Shot 2020-04-01 at 11 44 26 AM

*2. Create train and test `.txtfiles.** Here we createdata/coco16.txt, which contains the first 16 images of the COCO2017 dataset. We will use this small dataset for both training and testing. Each row contains a path to an image, and remember one label must also exist in a corresponding/labels` folder for each image containing objects.

Screen Shot 2020-04-01 at 11 47 28 AM

*3. Create new `.namesfile** listing the class names in our dataset. Here we use the existingdata/coco.namesfile. Classes are **zero indexed**, sopersonis class0,bicycleis class1`, etc.

Screenshot 2019-04-06 at 14 06 34

*4. Create new `.datafile** with your class count (COCO has 80 classes), paths to train and validation datasets (we use the same images twice here, but in practice you'll want to validate your results on a separate set of images), and with the path to your*.namesfile. Save asdata/coco16.data`.

Screen Shot 2020-04-01 at 11 48 41 AM

5. Update yolov3-spp.cfg (optional). By default each YOLO layer has 255 outputs: 85 values per anchor [4 box coordinates + 1 object confidence + 80 class confidences], times 3 anchors. Update the settings to filters=[5 + n] * 3 and classes=n, where n is your class count. This modification should be made in all 3 YOLO layers.

Screen Shot 2020-04-02 at 12 37 31 PM

6. (OPTIONAL) Update hyperparameters such as LR, LR scheduler, optimizer, augmentation settings, multi_scale settings, etc in train.py for your particular task. If in doubt about these settings, we recommend you start with all-default settings before changing anything.

7. Train. Run python3 train.py --cfg yolov3-spp.cfg --data data/coco16.data --nosave to train using your custom .data and .cfg. By default pretrained --weights yolov3-spp-ultralytics.pt is used to initialize your model. You can instead train from scratch with --weights '', or from any other weights or backbone of your choice, as long as it corresponds to your *.cfg.

Visualize Results

Run from utils import utils; utils.plot_results() to see your training losses and performance metrics vs epoch. If you don't see acceptable performance, try hyperparameter tuning and re-training. Multiple results.txt files are overlaid automatically to compare performance.

Here we see training results from data/coco64.data starting from scratch, a darknet53 backbone, and our yolov3-spp-ultralytics.pt pretrained weights.

download

Run inference with your trained model by copying an image to data/samples folder and running
python3 detect.py --weights weights/last.pt coco_val2014_000000001464

Reproduce Our Results

To reproduce this tutorial, simply run the following code. This trains all the various tutorials, saves each results*.txt file separately, and plots them together as results.png. It all takes less than 30 minutes on a 2080Ti.

git clone https://github.com/ultralytics/yolov3
python3 -c "from yolov3.utils.google_utils import gdrive_download; gdrive_download('1h0Id-7GUyuAmyc9Pwo2c3IZ17uExPvOA','coco2017demos.zip')"  # datasets (20 Mb)
cd yolov3
python3 train.py --data coco64.data --batch 16 --epochs 300 --nosave --cache --weights '' --name from_scratch
python3 train.py --data coco64.data --batch 16 --epochs 300 --nosave --cache --weights yolov3-spp-ultralytics.pt --name from_yolov3-spp-ultralytics
python3 train.py --data coco64.data --batch 16 --epochs 300 --nosave --cache --weights darknet53.conv.74 --name from_darknet53.conv.74
python3 train.py --data coco1.data --batch 1 --epochs 300 --nosave --cache --weights darknet53.conv.74 --name 1img
python3 train.py --data coco1cls.data --batch 16 --epochs 300 --nosave --cache --weights darknet53.conv.74 --cfg yolov3-spp-1cls.cfg --name 1cls

Reproduce Our Environment

To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:

mahiratmis commented 5 years ago

I trained and tested mnist data by using this tutorial. Thank you for guidance.

agp-ka32 commented 5 years ago

Hi @glenn-jocher ,

I am trying to train on my custom dataset and I get the following error

image

Can you please let me know the fix for this error? I see that 'model' class in utils.py does not have an attribute 'hyp'. I followed all the steps outlined in order.

Thanks.

agp-ka32 commented 5 years ago

Hi @glenn-jocher ,

I am trying to train on my custom dataset and I get the following error

image

Can you please let me know the fix for this error? I see that 'model' class in utils.py does not have an attribute 'hyp'. I followed all the steps outlined in order.

Thanks.

I tried on coco_10img.data; I get the same error.

glenn-jocher commented 5 years ago

@akshaygadipatil the hyp attribute contains hyperparameters set in train.py and attached to model as an easy way to pass the hyperparameters to build_targets() and compute_losses(). We just made this change today. Please git pull to get the absolute latest changes and try again.

Also, what happens if you simply run python3 train.py?

glenn-jocher commented 5 years ago

@akshaygadipatil the example executes correctly on CPU and single GPU. Your issue may be multi-GPU related (you did not specify in your post). If so, git pull and try again.

python3 train.py --data data/coco_1img.data
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-spp.cfg', data_cfg='data/coco_1img.data', dist_url='tcp://127.0.0.1:9999', epochs=273, evolve=False, img_size=416, multi_scale=False, nosave=False, notest=False, num_workers=4, rank=0, resume=False, transfer=False, var=0, world_size=1)

Using CPU

layer                                     name  gradient   parameters                shape         mu      sigma
    0                          0.conv_0.weight      True          864        [32, 3, 3, 3]   -0.00339     0.0648
    1                    0.batch_norm_0.weight      True           32                 [32]      0.987       1.07
    2                      0.batch_norm_0.bias      True           32                 [32]     -0.698       2.07
    3                          1.conv_1.weight      True        18432       [64, 32, 3, 3]   0.000298     0.0177
    4                    1.batch_norm_1.weight      True           64                 [64]       0.88      0.389
    5                      1.batch_norm_1.bias      True           64                 [64]     -0.409       1.01
 ...
  223                      112.conv_112.weight      True        65280     [255, 256, 1, 1]   0.000119     0.0362
  224                        112.conv_112.bias      True          255                [255]  -0.000773     0.0356
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
   0/272         0/0     0.192     0.105      15.3      2.36        18         4      5.58
               Class    Images   Targets         P         R       mAP        F1
Computing mAP: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.90s/it]
                 all         1         6         0         0         0         0

              person         1         3         0         0         0         0
           surfboard         1         3         0         0         0         0

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
   1/272         0/0     0.218    0.0781      15.3      2.36      17.9         5       8.2
               Class    Images   Targets         P         R       mAP        F1
Computing mAP: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.64s/it]
                 all         1         6         0         0         0         0

              person         1         3         0         0         0         0
           surfboard         1         3         0         0         0         0

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
   2/272         0/0     0.165    0.0669      14.7      2.31      17.2         5         7
               Class    Images   Targets         P         R       mAP        F1
Computing mAP: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.49s/it]
                 all         1         6         0         0         0         0

              person         1         3         0         0         0         0
           surfboard         1         3         0         0         0         0
agp-ka32 commented 5 years ago

@glenn-jocher, thanks! Sorry abt not mentioning single/multi gpu usage. I am actually running on a 2-GPU machine.

To be in sync, I tried with the latest changes in the repo. The training has begun. Thank you!

image

agp-ka32 commented 5 years ago

Hi @glenn-jocher ,

Ran into a problem- requesting help: For some reason, there was a power cut and so the gpu's shut off. I would like to resume the training process from the latest checkpoint "latest.pt". This was on a multi gpu machine. I tried changing the weight file in line 87 in train.py file: cutoff = load_darknet_weights(model, weights + 'latest.pt')

When I run python3 train.py, I get an error message: image

Can you help me solve this? Thanks!

agp-ka32 commented 5 years ago

Hi @glenn-jocher ,

Ran into a problem- requesting help: For some reason, there was a power cut and so the gpu's shut off. I would like to resume the training process from the latest checkpoint "latest.pt". This was on a multi gpu machine. I tried changing the weight file in line 87 in train.py file: cutoff = load_darknet_weights(model, weights + 'latest.pt')

When I run python3 train.py, I get an error message: image

Can you help me solve this? Thanks!

Never mind, I should have changed line 67 instead of 87 (in train.py). BTW, in train.py, I changed line 122 to- sampler=None as I was getting an error like as shown below with sampler=sampler

sampler option is mutually exclusive with shuffle

And the error was gone after my fix and the training began ( this was all yesterday). It is not wrong I believe. What do you say?

glenn-jocher commented 5 years ago

@akshaygadipatil as the README clearly states https://github.com/ultralytics/yolov3#training

Start Training: python3 train.py to begin training after downloading COCO data with data/get_coco_dataset.sh. Resume Training: python3 train.py --resume to resume training from weights/latest.pt.

Jriandono commented 5 years ago

@glenn-jocher Hi Glen didn't know that this custom training exist. Thanks for the reply earlier, I just abit confuse on how we actually train.

when we run

  1. Train. Run python3 train.py --data data/coco_10img.data to train using your custom data. If you created a custom *.cfg file as well, specify it using --cfg cfg/my_new_file.cfg.

are we actually training the model to look for the bounding box of a random image(from coco dataset)

because Im confused with step 1 and 2;

where 1 you convert your data into darknet format where it consist of 1.jpg(image) and 1.txt(bounding boxes)

but in 2 we actually train with our coco dataset, not our data set? since the text file is the path of images I guess I just don't get on how to modify #2

glenn-jocher commented 5 years ago

@Jriandono you need to create your own *.txt files pointing to your own list of training and testing images. coco_10img.txt is an example with 10 images in it. Clearly, you make your own if you want to use your own data.

guxiaowei1 commented 5 years ago

I want to train custom data ,but the following error happened. I think my converted.pt was not correct ,i dont kown how to modify it ,please help me .

Namespace(accumulate=1, backend='nccl', batch_size=1, cfg='cfg/yolov3.cfg', data_cfg='data/coco_10img.data', dist_url='tcp://127.0.0.1:9999', epochs=273, evolve=False, img_size=416, multi_scale=False, nosave=False, notest=False, num_workers=0, rank=0, resume=False, transfer=False, var=0, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='GeForce GTX 1050', total_memory=2048MB)
Traceback (most recent call last):
  File "G:/pycharm/yolo/yolov3-master/train.py", line 309, in <module>
    multi_scale=False,
  File "G:/pycharm/yolo/yolov3-master/train.py", line 88, in train
    chkpt = torch.load(latest, map_location=device)  # load checkpoint
  File "C:\Users\HP\Anaconda3\envs\wei\lib\site-packages\torch\serialization.py", line 368, in load
    return _load(f, map_location, pickle_module)
  File "C:\Users\HP\Anaconda3\envs\wei\lib\site-packages\torch\serialization.py", line 532, in _load
    magic_number = pickle_module.load(f)
_pickle.UnpicklingError: invalid load key, '5'.
glenn-jocher commented 5 years ago

@you don't need converted.pt to train custom data, you can start training from scratch (i.e. the darknet53 backbone). Just run: python3 train.py --data data/mycustomfile.data --cfg cfg/mycustomfile.cfg

guxiaowei1 commented 5 years ago

@you don't need converted.pt to train custom data, you can start training from scratch (i.e. the darknet53 backbone). Just run: python3 train.py --data data/mycustomfile.data --cfg cfg/mycustomfile.cfg

@you don't need converted.pt to train custom data, you can start training from scratch (i.e. the darknet53 backbone). Just run: python3 train.py --data data/mycustomfile.data --cfg cfg/mycustomfile.cfg

Thank u so much for your kind reply. if i want to tranfer learning ,how to deal with that question?The converted.pt was created by convert.py in yolov3

Sam813 commented 5 years ago

@glenn-jocher First of all, Thank you for creating this repository. I have followed all the above steps to train the model on my own dataset. I have 3 classes of samples. so I have modified the filters in *.cfg to filters = 24 but I have one error of image

I guess most probably it is due to my image input size. My images are all in the fixed size of 100x100. would you please guide me which part of the code would be affected by this?

glenn-jocher commented 5 years ago

@Sam813 this may be related to a recent commit which was fixed. git pull and try again?

Sam813 commented 5 years ago

@Sam813 this may be related to a recent commit which was fixed. git pull and try again?

Hi @glenn-jocher, I have tried the new git pull. After that, I the below error happens in some recently added part of the code. image

I have 3 classes, and also modified the data.cfg and *.cfg

glenn-jocher commented 5 years ago

@Sam813 your custom data is not configured correctly. If you have 3 classes they should be zero indexed and the class counts in your cfg and .data file should correspond. The error message is saying you are stating 4 classes somewhere and it is not matching up with 3.

Sam813 commented 5 years ago

@Sam813 your custom data is not configured correctly. If you have 3 classes they should be zero indexed and the class counts in your cfg and .data file should correspond. The error message is saying you are stating 4 classes somewhere and it is not matching up with 3.

  • Your custom data. If your issue is not reproducible with COCO data we can not debug it. Visit our Custom Training Tutorial for exact details on how to format your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.

@glenn-jocher Thank you for your help, I found the problem, I had forgotten to set num classes in the cfg file. But now my training results are not making any sense.

image

all the Precision, recall and F1 are constantly 0. Yet, I can see the confidence is reducing true the training. Do you have any idea whats wrong here?

glenn-jocher commented 5 years ago

@Sam813 you are plotting multiple runs sequentially, as results.txt is not erased between runs. If you have zero losses for bounding box regresions, it means you have no bounding boxes to regress, which likely means you have no targets at all, and that the repo can not find your training data.

Sam813 commented 5 years ago

@Sam813 you are plotting multiple runs sequentially, as results.txt is not erased between runs. If you have zero losses for bounding box regresions, it means you have no bounding boxes to regress, which likely means you have no targets at all, and that the repo can not find your training data.

@glenn-jocher thank you for the help, If I am not mistaken what you said means my data is not prepared properly? But I have followed the steps to prepare the data. Moreover, I got this output picture for the test batch which confused me: image

Does it have any meaning for you? I guess the bounding boxes have been detected but the image is pure white? could you help to explain it a bit more?

glenn-jocher commented 5 years ago

@Sam813 no this is not correct, your data seems to be missing the images. The train_batch0.jpg file generated when training starts (for correctly prepared data) should look similar to this:

train_batch0.jpg train_batch0

Ai-is-light commented 5 years ago

@glenn-jocher would mind give me an example about "Box coordinates must be in normalized xywh format (from 0 - 1)." I'm a little bit confused about normalized xywh

sanazss commented 5 years ago

Dear glenn, I have satellite single channel data and a single class. I already followed the instruction on data preparation and provided bounding boxes; however, I still have two issues. First I want to load images which is different from loading other pictures and I should use gdal for that. then I want to resize them because their size is 512*512 at the moment as well as normalize them and convert them to tensor. The second issue is splitting them into training and validation set. I am following ultralytics code and would like to get some advice on customizing my data in the class LoadImages and class LoadImagesAndLabels(Dataset). Many thanks for any advice.Sanaz

glenn-jocher commented 5 years ago

@sanazss this repo handles all image loading, resizing, rescaling and augmentation for a variety of image formats including tif. I recommend you simply follow the example tutorials with your existing data.

sanazss commented 5 years ago

I downloade coco data and its structure is like two separate folders for images and labels each contain train and validation set. However I found that you put this path : def coco_class_count(path='coco/images/labels/train2014/'): while there is not any folder named labels in images folder. I am so confused about this structuring. And I think that is one of the reasons I cannot run the code properly on my images.

glenn-jocher commented 5 years ago

@sanazss you need to mirror the coco data structure properly to train. The python argument you mention has no place in this discussion, its not called at all during training.

Chida15 commented 5 years ago

hello, I just got 0% mAP all the time, what should I do to solve it?

glenn-jocher commented 5 years ago

@Chida15 start by reproducing the tutorial.

n0ct4li commented 4 years ago

Hi,

I try to apply the tutorial using kitti dataset. I have a GPU so I select device=0 in the train file, but when I am running it is still using the CPU. Anyone had trouble with this?

pbYolo Nvidia

glenn-jocher commented 4 years ago

@GotCstl PyTorch is unable to locate your GPU. You need to install PyTorch and CUDA correctly.

n0ct4li commented 4 years ago

I Manage to launch on GPU. But now I have the following error : PbYolo2

glenn-jocher commented 4 years ago

@GotCstl your data is likely not formatted correctly. Start from the tutorial dataset (i.e. coco_64img.data) and go from there.

n0ct4li commented 4 years ago

I am on windows and here is an extract of the text file for train img locations and Data file . Is there any problem?

pb4 pb5

n0ct4li commented 4 years ago

@glenn-jocher I mange the problem. Image size in config file and train file differs.

Just want to know, how can I get in a file the values for the bounding box coordinates prediction for each test image?

glenn-jocher commented 4 years ago

@n0ct4li detect.py lets you save outputs to a text file by setting save_txt=True: https://github.com/ultralytics/yolov3/blob/b62dc6f06a4288d759151ea8289da0908f64db2c/detect.py#L9

n0ct4li commented 4 years ago

@glenn-jocher Perfect. And last question, if I want to train on gray-scale images; I just have to set channels = 1 in the config file or is there others things to change? Like maybe number of trainable layers

edit : Got the following error with just changing the channel parameter to 1 pb6

glenn-jocher commented 4 years ago

@n0ct4li yes for greyscale images set channels=1 in your cfg file.

Beware you may not be able to preload a backbone this way.

n0ct4li commented 4 years ago

Why Am I getting an error?(I put channels=1)

GKDHurryUp commented 4 years ago

Thank for your sharing.I follow your tutorial and train my custom data on 2 classes, but I got low precison and low mAP. results

How to solve this problem?

oscarzasa commented 4 years ago

Hi,

I'm trying to train on my own custom data set for only one class but I get the following error: fail_classes I've trained with the same .cfg file on PyTorch before and I'm pretty sure my .data file and my .txt files are correct. But I'm not sure if I need to modify something in train.py to training for one class. I would really appreciate the help. Cheers

glenn-jocher commented 4 years ago

@oscarzasa see the single class tutorial in the wiki: https://github.com/ultralytics/yolov3/wiki

glenn-jocher commented 4 years ago

@GKDHurryUp read up on vision and ML and follow some of the common options: change your batch size, optimzer, hyperparameters, increase training data, img-size etc.

GKDHurryUp commented 4 years ago

Hi,

I'm trying to train on my own custom data set for only one class but I get the following error: fail_classes I've trained with the same .cfg file on PyTorch before and I'm pretty sure my .data file and my .txt files are correct. But I'm not sure if I need to modify something in train.py to training for one class. I would really appreciate the help. Cheers

Hi,

I'm trying to train on my own custom data set for only one class but I get the following error: fail_classes I've trained with the same .cfg file on PyTorch before and I'm pretty sure my .data file and my .txt files are correct. But I'm not sure if I need to modify something in train.py to training for one class. I would really appreciate the help. Cheers

I think you should check the classes and filters number around [yolo] in the .cfg file.There is no need to modify in train.py except some files path.

GKDHurryUp commented 4 years ago

@GKDHurryUp read up on vision and ML and follow some of the common options: change your batch size, optimzer, hyperparameters, increase training data, img-size etc.

Thank you a lot.I'll try it.

oscarzasa commented 4 years ago

@glenn-jocher thanks mate, I will. Cheers.

oscarzasa commented 4 years ago

@GKDHurryUp I already modified my .cfg file for one class and with the proper number of filters because I previously trained the same file using Darknet. But maybe I missing something else. Thanks tho.

houzeyu2683 commented 4 years ago

Does anyone known how to normalize the box xywh ? Or for example, an image size is (640,480,3), original box is (38, 75,20, 60)? What is the normalize result??

RajashekarY commented 4 years ago

Here you will find the solution @Hzyu810225

https://blog.goodaudience.com/part-1-preparing-data-before-training-yolo-v2-and-v3-deepfashion-dataset-3122cd7dd884

houzeyu2683 commented 4 years ago

Is it possible initial model weight before training model? i.e. I don't want to training model by load yolo.weight. I want random initial weight.