Training using custom dataset does not converge

MSch8791 commented 3 years ago

Hi,

I am training the model efficientdet-d0 with a custom dataset but the training does not converge and loss stays around 2. I have converted the dataset in the COCO format to the TFRecord format using the create_coco_tfrecord script. When inspecting the resulting TFRecords using the script dataset/inspect_tfrecords.py everything seems good (at least visually on the generated images).

I have tried many tips coming from issues #676 #691 and #213 however nothing helped.

My config file is :

num_classes: 61
var_freeze_expr: '(efficientnet|fpn_cells|resample_p6)'
input_rand_hflip: false
learning_rate: 0.005
lr_warmup_init: 0.0005
moving_average_decay: 0.0
image_size: 240

I use the following command to train the model :

python main.py --mode=train --train_file_pattern=..\..\..\data\tfrecords\train_set\*.tfrecord --val_file_pattern=..\..\..\data\tfrecords\val_set\*.tfrecord --model_name=efficientdet-d0 --model_dir=tmp\efficientdet-d0-litterdetection --ckpt=efficientdet-d0 --train_batch_size=4 --eval_batch_size=4 --eval_samples=150 --num_examples_per_epoch=1200 --num_epochs=20 --hparams=D:\Projects\Litter_Detection\TACO\TACO_efficiendet_config.yaml

I have tried various values for the learning rate, the image size, the number of epochs and the number of samples per epoch but nothing helped to converge. I tried with use_keras_model: false too but no changes.

kartik4949 commented 3 years ago

also try with deleting var_freeze_expr and rerun

fsx950223 commented 3 years ago

Change anchors.

MSch8791 commented 3 years ago

Where can I change the anchors sizes ? It seems I can just change the related aspect ratios in the parameters.

EDIT: Based on what is said in the issue #524 by mingxingtan: "[...] the current default settings are level 3-7 and anchor_scale 4, which means they would cover object sizes from 4 2^3 = 32, to 4 2^7 = 512. You can use larger anchor scale (e.g. anchor_scale=5) if your datasets have a lot of big objects, or smaller anchor scale for smaller objects (e.g. anchor_scale=3 or 2). [...]"

It seems that to change the anchors sizes we need to change the parameters anchor_scale and potentially min_level and max_level. Is that right ?

I am trying with anchor_scale 1.0 and 2.0 to have anchors sizes respectively from 8 to 128 and from 16 to 256.

SiBensberg commented 3 years ago

@MSch8791 Try to find out which sizes the objects in your dataset have. The sizes you cited are always regarding the input picture. You do not necesserely need to change the min_ and max_level. There is a good chance that just anchor changes will work.

The shapes of your objects could also be a problem. For example wanting to detect poles will be difficult with standard settings because of the 'extreme' aspect ratio. For that change the anchor ratios with: h.aspect_ratios.

MSch8791 commented 3 years ago

Using the script provided by mnslarcher as shown in the issue #412 , I got these results :

python kmeans_anchors_ratios.py --instances ..\..\..\data\annotations.json --anchors-sizes 8 16 32 64 128 --input-size 512 --normalizes-bboxes True --num-runs 3 --num-anchors-ratios 4 --max-iter 300 --min-size 0 --iou-threshold 0.5 --decimals 1

...
[01/04 21:18:21] Starting the calculation of the optimal anchors ratios
[01/04 21:18:21] Extracting and preprocessing bounding boxes
[01/04 21:18:21] Discarding 0 bounding boxes with size lower or equal to 0
[01/04 21:18:21] K-Means (3 runs): 100%|██████████████████| 3/3 [00:00<00:00, 17.49it/s]
        Runs avg. IoU: 84.07% ± 0.00% (mean ± std. dev. of 3 runs, 0 skipped)
        Avg. IoU between bboxes and their most similar anchors after norm. them to make their area equal (only ratios matter): 84.07%
[01/04 21:18:21] Default anchors ratios: [(0.7, 1.4), (1.0, 1.0), (1.4, 0.7)]
        Avg. IoU between bboxes and their most similar default anchors, no norm. (both ratios and sizes matter): 62.94%
        Num. bboxes without similar default anchors (IoU < 0.5):  829/4784 (17.33%)
[01/04 21:18:21] K-Means anchors ratios: [(0.7, 1.3), (1.0, 1.0), (1.3, 0.7)]
        Avg. IoU between bboxes and their most similar K-Means anchors, no norm. (both ratios and sizes matter): 63.64%
        Num. bboxes without similar K-Means anchors (IoU < 0.5):  823/4784 (17.20%)
[01/04 21:18:21] K-Means anchors have an IoU < 50% with bboxes in 0.13% less cases than the default anchors, you should consider to use them

So I left the default anchor ratios since they are valid.

I have tried to train with anchors_scale = 1.0, with and without var_freeze_expr: '(efficientnet|fpn_cells|resample_p6)', modifying the learning rate, etc but nothing helped...

Now my config file is as below :

num_classes: 61
var_freeze_expr: '(efficientnet|fpn_cells|resample_p6)'
input_rand_hflip: false
learning_rate: 0.008
lr_warmup_init: 0.0008
moving_average_decay: 0.0
image_size: 512
anchor_scale: 1.0

EDIT: I've noticed that there is a variable num_scales that had value 3 by default, since I want 5 scales I changed it to the value 5. I have changed the aspect_ratios variable to take into account more ratios. I am trying to train again with these parameters this time.

h.min_level = 3
h.max_level = 7
h.num_scales = 5 #3
# ratio w/h: 2.0 means w=1.4, h=0.7. Can be computed with k-mean per dataset.
h.aspect_ratios = [[0.6, 1.6], [0.8, 1.2], [1.0, 1.0], [1.2, 0.8], [1.5, 0.6]]

EDIT 2 : Unfortunately, it still does not converge.

I0105 02:08:47.381953 44084 basic_session_run_hooks.py:260] loss = 2.1648629, step = 59900 (20.873 sec)
INFO:tensorflow:box_loss = 0.0155942645, cls_loss = 1.2619246, det_loss = 2.041638, step = 59900 (20.873 sec)

tensorboard_efficientdet_not_converge

SiBensberg commented 3 years ago

My objects are between 6px and ~300px an anchor scale of 0.8 betters my results from 0.5 to 0.85. Did you try what Kartik said? Because freezing variables just work good if a pretrained model is loaded. What model do you start training on? My last guess is to check the jitter. If your objects are heavily dependent on color. You should disable it.

kartik4949 commented 3 years ago

Yes please try with no var freeze

MSch8791 commented 3 years ago

My objects are between 4px and 250+ pixels rescaled on a 512x512px image. The problem here is that I am not yet fine-tuning to find the best parameters for my problem but I am just trying to make the model converge to have a first working version of parameters because on eval results everything is zero. I already tried to remove the freezing of variables, and it didn't help much. I am training on the provided checkpoint of efficientdet-d0 as you can see in the command line.

Concerning the jitter I will check it.

Another information is because I have an humble configuration with a humble GPU capacity (NVIDIA GeForce GTX 960M), I can't set a training batch superior to 1 otherwise I it runs out of memory. And I am under Windows on this machine unfortunately which caps the VRAM usage.

kartik4949 commented 3 years ago

Please try with batch size 8 with gradient checkpointing enabled?

MSch8791 commented 3 years ago

I enabled grad_checkpoint and batch size 8 but I got this error :

2021-01-05 22:38:54.459069: F .\tensorflow/core/kernels/conv_2d_gpu.h:1019] Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles<T, kNumThreads, kTileSize, kTileSize, conjugate>, total_tiles_count, kNumThreads, 0, d.stream(), input, input_dims, output) status: Internal: the launch timed out and was terminated
2021-01-05 22:38:54.459572: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated
Fatal Python error: Aborted

SiBensberg commented 3 years ago

Why did you set h.moving_average_decay to 0?

MSch8791 commented 3 years ago

Why did you set h.moving_average_decay to 0?

I tried to set it to 0 based on what I read in issue #691 I've set again this parameter to the original default value but it has no impact on my training, still stuck around 1.8-2.0.

I put back batch size = 1 since it gives me errors when increasing the size. I have observed that I can reach a batch size = 2 if I decrease the input size under 300px and based on how many anchors scales I use. Seems to be related to video memory allocation vs capacity issue even if tensorflow or cuda is not showing me any OOM errors this time.

MSch8791 commented 3 years ago

I am thinking about something : I generated the TFRecord files from the dataset leaving the images and annotations with their original sizes. Do I needed to rescale to the desired input size before converting to TFRecord files or does the efficientdet code is doing it for me (image and bbox rescaling) ?

EDIT : Sampling more data using script dataset/inspect_tfrecords.py shows me that some bounding box are not well rescaled. Some are corrects and others are incorrectly rescaled. Is that a bug ?

SiBensberg commented 3 years ago

The EfficientDet Code scales down the boxes and pictures. But you should definitely check these boxes. You can also try this tool: https://github.com/sulc/tfrecord-viewer If the boxes are also incorrect you should check your dataset creation.

kartik4949 commented 3 years ago

@SiBensberg we now support inhouse tool for that please use dataset/inspect_record.py

superBertBerg commented 3 years ago

@MSch8791 which optimizer do you use? -> ADAM did the trick for me, it's a little to a lot more robust.

MSch8791 commented 3 years ago

Thanks for your replies.

I tried to resize my dataset by myself so I wrote a script to do it. With it I resized the images to be 512x512px and made the necessary changes in the annotations file about the bounding box definitions. Once this done, I observed that when using the dataset/inspect_tfrecords.py script it shows me correct bounding boxes (even if asking it to rescale to 256x256px or 800x800px for example in the yaml configuration file). So there were indeed a bug in the inspect_tfrecords.py script when rescaling the bounding boxes from my dataset's original scales. Finally I re-managed my classes to have only 1 class in order to simplify the game and because my dataset was quite imbalanced.

Anyway, I then tried to train and I have seen the loss converging a bit more, I even got some correct detections when inspecting resulting images from the model_inspect.py script, even if the performances were still very poor...

After a while and after a lot of trials of modifying parameters and training on my machine. I decided to give a try on a Colab notebook having a NVIDIA T4 gpu under Linux at my disposal which let me increase the train batch size. After a first try of training and even stopping it after only 3000+ steps it worked ! Getting decent performances. The command line I used :

python main.py --mode=train --train_file_pattern="/content/tfrecords/train_set/*.tfrecord" --model_name=efficientdet-d0 --model_dir="/content/effdetd0" --ckpt="/content/efficientdet-d0" --hparams="num_scales=5, num_classes=2, anchor_scale=2.0, image_size=512, input_rand_hflip=false, autoaugment_policy='v0'" --train_batch_size=16

I even trained a efficientdet D2 model (below the evaluation on my validation set) :

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.324
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.515
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.354
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.084
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.454
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.608
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.198
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.386
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.456
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.217
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.604
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.726

As I am curious I tried to train again on my machine with the same parameters as on my Colab notebook for efficiendet D0, decreasing the batch size to 1 of course since otherwise it runs OOM. But it didn't work really well. I am pretty sure it is related to train batch size.

ameyparanjape commented 3 years ago

anchor_scale=2.0 @MSch8791 What was the process to get the optimum h.anchor_scale?

google / automl

Training using custom dataset does not converge #920