Validation mAP not consistent

JerickoDG commented 1 year ago

After finishing 20 epochs, the terminal displayed this:

wandb: Run summary: wandb: train_loss_box_reg 0.01077 wandb: train_loss_cls 0.00603 wandb: train_loss_epoch 0.01768 wandb: train_loss_iter 0.01653 wandb: train_loss_obj 0.00029 wandb: train_loss_rpn 0.00059 wandb: val_map_05 0.87624 wandb: val_map_05_95 0.58491

As can be seen, the mAP@0.5 and mAP@0.5-0.9 are relatively high for the last model state. To verify it, I tried evaluating the _lastmodel.pth using the eval.py script with this command:

python eval.py --weights outputs/training/custom_training/last_model.pth --data data_configs/custom_data.yaml --model fasterrcnn_resnet50_fpn_v2 --imgsz 320 --verbose

Which gave me this result:

{'classes': tensor([1, 2], dtype=torch.int32), 'map': tensor(0.3669), 'map_50': tensor(0.6886), 'map_75': tensor(0.3405), 'map_large': tensor(0.4444), 'map_medium': tensor(0.3408), 'map_per_class': tensor([0.4621, 0.2718]), 'map_small': tensor(0.2078), 'mar_1': tensor(0.4233), 'mar_10': tensor(0.4997), 'mar_100': tensor(0.5007), 'mar_100_per_class': tensor([0.6162, 0.3851]), 'mar_large': tensor(0.5776), 'mar_medium': tensor(0.4468), 'mar_small': tensor(0.3199)}

As can be seen, the resulting mAP@0.5 and mAP@0.5-0.9 were different from each other. The mAPs shown from the eval.py were lower.

Also, it seems that the script automatically uses the directory or path TEST_DIR_LABELS if it exists so I changed it to the directory or path of validation set like this:

TRAIN_DIR_IMAGES: 'custom_data/train' TRAIN_DIR_LABELS: 'custom_data/train' VALID_DIR_IMAGES: 'custom_data/valid' VALID_DIR_LABELS: 'custom_data/valid' TEST_DIR_IMAGES: 'custom_data/valid' TEST_DIR_LABELS: 'custom_data/valid'

I am not sure where I went wrong. Do you happen to know the probable cause?

sovit-123 commented 1 year ago

Hi. There are a few things to consider here. If the dataset YAML file contains a test set, then the evaluation script will automatically use the test set for evaluation. Else, it uses the validation set.

However, it seems that you have already changed that. Then only one thing remains. Can you please check the log file from the training output directory and see whether the final epoch's log matches the evaluation logs? I will need to check as well. However, from my multiple experiments, I can confirm that the mAP should be the same. Instead of the WandB logs, please check the model logs from the directory once.

JerickoDG commented 1 year ago

Hi. Thank you for your response. By _modellogs, do you mean the train.log file? If yes, it did not have any contents (0 KB file size). I am currently guessing that it was overwritten when I executed:

python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --resume

I just executed that resume training command whenever there was an error wherein the training stopped.

Thus, I think that putting the --name custom_training might've overwritten the train.logs since the training stopped (due to an error) at the last epoch when saving the best model. However, the last epoch (20th) was recorded to the results.csv file (meaning it was finished) which could mean that when I entered the aforementioned command, the train.log was overwritten with nothing or blank because there was nothing to continue because it already reached the last epoch.

That is only my guess on why the content of the train.log is empty or blank. Please share your thoughts if you think so too or have other guesses.

sovit-123 commented 1 year ago

Yes, that may be the reason it is blank.

sovit-123 commented 1 year ago

Also, please check that the --imgsz is 320 during training. They should be the same during training and evaluation. Otherwise, the numbers will be different.

I will also check one thing from my side whether WandB at the end shows the mAP from the best epoch or the last epoch. Because from multiple experiments I know that the numbers should match.

JerickoDG commented 1 year ago

Yes, that may be the reason it is blank.

With that said, should I omit the --name custom_training to my resume training command so it stops overwriting the train.logs from the epoch I resumed up to the epoch I stopped?

JerickoDG commented 1 year ago

Also, please check that the --imgsz is 320 during training. They should be the same during training and evaluation. Otherwise, the numbers will be different.

I will also check one thing from my side whether WandB at the end shows the mAP from the best epoch or the last epoch. Because from multiple experiments I know that the numbers should match.

The image size on my opt.yaml file is 320x320. This is the content of the file:

model: fasterrcnn_resnet50_fpn_v2 data: data_configs\custom_data.yaml device: cuda epochs: 20 workers: 4 batch: 4 lr: 0.001 imgsz: 320 name: custom_training vis_transformed: false mosaic: 0.0 use_train_aug: false cosine_annealing: false weights: outputs\training\custom_training\last_model.pth resume_training: true square_training: false world_size: 1 dist_url: env:// disable_wandb: false sync_bn: false amp: false seed: 0 project_dir: null distributed: false

sovit-123 commented 1 year ago

Yes, you can give a different name. It will resume training but the resulting directory will be different. You will separate logs for both the training for later analysis.

JerickoDG commented 1 year ago

Yes, you can give a different name. It will resume training but the resulting directory will be different. You will separate logs for both the training for later analysis.

Noted. May I ask if the graphs (.png files of mAP and losses) will still be consistent or continuous across those separate resulting directories? For instance, in output/training/res_1, I trained my model from epochs 1 to 5. I stopped it. After that, I resumed it from epochs 6 to 10 to be stored in output/training/res_2. Then stopped it again to resume training the next day. Finally, I resumed it from epochs 11 to 20 (stored in output/training/res_3) to end the training.

Will the graphs (.png files of mAP and losses) still be consistent or continuous in that case?

sovit-123 commented 1 year ago

Yes, they will be consistent. The train loss list, train loss epoch, and epoch graphs will be consistent. The individual losses will be created new each time.

JerickoDG commented 1 year ago

Okay, noted. I will try to train again with stop and resume then observe and provide an update about the outcome, especially for the consistency of mAP output from wandb and local model directory. Thank you very much again for your answers.

sovit-123 commented 1 year ago

No issues. Let me know.

JerickoDG commented 1 year ago

Hi. I would like to provide an update regarding my observation.

I tried training the model for 4 epochs. I stopped the training during the 3rd epoch (2 if zero-indexed) and resumed up to the fourth and final epoch using the command: python train.py --data data_configs\custom_data.yaml --weights outputs\training\res_1\last_model.pth --model fasterrcnn_resnet50_fpn_v2 --imgsz 320 --epochs 4 --resume

Thus, there were two folders in outputs/training which are res_1 and res_2. I noticed the ff:

Inside the resumed training folder (res_2), the train.log only contains the training info logs from the third and fourth epoch (2 and 3 if zero-indexed). The same goes for the content of results.csv in res_2 folder.
The mAP.png, train_loss_epoch.png, and train_loss_iter.png shown a continuous graph but train_loss_bbox_reg.png, train_loss_cls.png, train_loss_obj.png, and train_loss_rpn_bbox.png only had two points in the image graph (for res_2, I assume that they are from third and fourth epochs (2 and 3 if zero-indexed).

The results of the final epoch (4th) from the wandb are the ff:

wandb: Run summary: wandb: train_loss_box_reg 0.03585 wandb: train_loss_cls 0.01717 wandb: train_loss_epoch 0.05724 wandb: train_loss_iter 0.0605 wandb: train_loss_obj 0.00183 wandb: train_loss_rpn 0.00238 wandb: val_map_05 0.88586 wandb: val_map_05_95 0.5767

The results of the final epoch (4th) using the eval.py with the command: python eval.py --data data_configs/custom_data.yaml --weights outputs/training/res_2/last_model.pth --model fasterrcnn_resnet50_fpn_v2 --imgsz 320 --verbose

{'classes': tensor([1, 2], dtype=torch.int32), 'map': tensor(0.3774), 'map_50': tensor(0.7105), 'map_75': tensor(0.3364), 'map_large': tensor(0.4573), 'map_medium': tensor(0.3382), 'map_per_class': tensor([0.4867, 0.2680]),
'map_small': tensor(0.2141), 'mar_1': tensor(0.4326), 'mar_10': tensor(0.5235), 'mar_100': tensor(0.5241), 'mar_100_per_class': tensor([0.6467, 0.4016]), 'mar_large': tensor(0.5880), 'mar_medium': tensor(0.4661), 'mar_small': tensor(0.3583)}

As can be seen, the results for mAP@0.5 and mAP@0.5-0.9 were still different for the validation data.

For additional information, here are the contents of the opt.yaml from res_2 folder:

model: fasterrcnn_resnet50_fpn_v2 data: data_configs\custom_data.yaml device: cuda epochs: 4 workers: 4 batch: 4 lr: 0.001 imgsz: 320 name: null vis_transformed: false mosaic: 0.0 use_train_aug: false cosine_annealing: false weights: outputs\training\res_1\last_model.pth resume_training: true square_training: false world_size: 1 dist_url: env:// disable_wandb: false sync_bn: false amp: false seed: 0 project_dir: null distributed: false

And here is the contents of the __custom_data.yaml__:

**TRAIN_DIR_IMAGES: 'custom_data/train' TRAIN_DIR_LABELS: 'custom_data/train' VALID_DIR_IMAGES: 'custom_data/valid' VALID_DIR_LABELS: 'custom_data/valid' valid_DIR_IMAGES: 'custom_data/valid' valid_DIR_LABELS: 'custom_data/valid'

CLASSES: [ 'background', 'handgun', 'knife' ]

NC: 3

SAVE_VALID_PREDICTION_IMAGES: True**

As of now, I am thinking that the problem might come from when the training pipeline resumes the training from the most recent epoch. But I am not sure specifically why and where. Because I tried training continuously (no stop and resume) and the mAPs (0.5 and 0.5-0.9) from wandb and eval.py for validation set were equal or at least very near each other.

May I know your thoughts about this and the possible fix we can do? I hope for your response. Thank you very much.

sovit-123 commented 1 year ago

If you are getting same mAP with continuous training and not with resume training, then I will take a look. I have not compared mAP values with resume training yet, so I guess I missed this issue. Thanks for bringing it up.

For now, I can say that you can safely resume training. Just be sure to not rely on the WandB logs and run eval.py using the best saved model. I am sure that you will have no issues with that.

Let me know if you want to keep this issue open or I can close it and keep you posted on the progress.

JerickoDG commented 1 year ago

Okay, noted. I would like to keep this issue open until it is resolved and please keep me posted on this thread as well. I would like to try training (with stop and resume) and observing again as soon as the fixed version gets pushed and merged with the main branch.

Thank you very much.

sovit-123 commented 1 year ago

So, I stopped training and resumed. This is from WandB from the last epoch after resuming.

wandb: Waiting for W&B process to finish... (success).
wandb: | 170.694 MB of 170.694 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: train_loss_box_reg █▃▁
wandb:     train_loss_cls █▃▁
wandb:   train_loss_epoch █▃▁
wandb:    train_loss_iter █▃▃▃▃█▄▅▃▆▅▃▂▅▂▁▅▂▁▂▃▃▃▁▂▃▃▂▂▂▃▅▃▃▂▂▂▁▂▅
wandb:     train_loss_obj █▃▁
wandb:     train_loss_rpn █▄▁
wandb:         val_map_05 ▁▄█
wandb:      val_map_05_95 ▁▄█
wandb: 
wandb: Run summary:
wandb: train_loss_box_reg 0.28041
wandb:     train_loss_cls 0.2824
wandb:   train_loss_epoch 0.62553
wandb:    train_loss_iter 1.20801
wandb:     train_loss_obj 0.03414
wandb:     train_loss_rpn 0.02859
wandb:         val_map_05 0.45671
wandb:      val_map_05_95 0.23937

And this is after running evaluation using the following command. python eval.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs/training/resume_test_2/last_model_state.pth --data data_configs/aquarium.yaml --imgsz 640 --square-training

'map': tensor(0.2399),
'map_50': tensor(0.4626),

I think they are pretty close. Whatever floating difference is there maybe because during training/validation loop it uses pycocotools and eval.py uses torchmetrics. But I think, the mAP is consistent.

Let me know your thoughts.

JerickoDG commented 1 year ago

Hi. Thanks for the update. May I if the --square-training is recommended to have for the eval.py command?

May I also ask for your initial training command, how you stopped the initial training, then the resume training command you used? Maybe the probable cause lies there, just a guess for now though.

JerickoDG commented 1 year ago

May I also take a look at the custom_data.yaml you configured and used?

sovit-123 commented 1 year ago

Sure. I used the aquarium dataset whose YAML file already comes with the repository. I commented out the test paths.

# Images and labels direcotry should be relative to train.py
TRAIN_DIR_IMAGES: '../input/Aquarium Combined.v2-raw-1024.voc/train'
TRAIN_DIR_LABELS: '../input/Aquarium Combined.v2-raw-1024.voc/train'
VALID_DIR_IMAGES: '../input/Aquarium Combined.v2-raw-1024.voc/valid'
VALID_DIR_LABELS: '../input/Aquarium Combined.v2-raw-1024.voc/valid'
# Optional test data path. If given, test paths (data) will be used in
# `eval.py`.
#TEST_DIR_IMAGES: '../input/Aquarium Combined.v2-raw-1024.voc/test'
#TEST_DIR_LABELS: '../input/Aquarium Combined.v2-raw-1024.voc/test'

# Class names.
CLASSES: [
    '__background__',
    'fish', 'jellyfish', 'penguin',
    'shark', 'puffin', 'stingray',
    'starfish'
]

# Number of classes (object classes + 1 for background class in Faster RCNN).
NC: 8

# Whether to save the predictions of the validation set while training.
SAVE_VALID_PREDICTION_IMAGES: True

JerickoDG commented 1 year ago

Hi, I was experimenting with the aquarium dataset using Google Colab. I had same result as yours when --imgsz 640 which is according to the script is by default. However, when I adjusted the image size using --imgsz 320, the mAPs became inconsistent. I stopped at the third epoch and finished from there up to the fifth and last epoch.

Initial Training command: !python /content/fastercnn-pytorch-training-pipeline/train.py --data /content/fastercnn-pytorch-training-pipeline/data_configs/custom_data.yaml --epochs 5 --model fasterrcnn_resnet50_fpn_v2 --name custom_training_320 --batch 4 --imgsz 320

Resume Training command: !python /content/fastercnn-pytorch-training-pipeline/train.py --data /content/fastercnn-pytorch-training-pipeline/data_configs/custom_data.yaml --weights /content/outputs/training/custom_training_320/last_model.pth --epochs 5 --model fasterrcnn_resnet50_fpn_v2 --name custom_training_320_2 --batch 4 --imgsz 320 --resume

Evaluation command: !python /content/fastercnn-pytorch-training-pipeline/eval.py --model fasterrcnn_resnet50_fpn_v2 --weights /content/outputs/training/custom_training_320_2/last_model_state.pth --data /content/fastercnn-pytorch-training-pipeline/data_configs/custom_data.yaml --imgsz 320

This is the result from wandb (from resumed epoch (third) up to the fifth and last epoch):

wandb: Waiting for W&B process to finish... (success). wandb: wandb: Run history: wandb: train_loss_box_reg █▃▁ wandb: train_loss_cls █▃▁ wandb: train_loss_epoch █▃▁ wandb: train_loss_iter █▃▄▆▃▃▄▆▃▄▄▃▃▄▃▃▆▄▄▃▄▅▄▃▅▁▃▃▂▃▂▃▃▂▃▃▃▄▄▁ wandb: train_loss_obj █▃▁ wandb: train_loss_rpn █▄▁ wandb: val_map_05 ▁▅█ wandb: val_map_05_95 ▁▄█ wandb: wandb: Run summary: wandb: train_loss_box_reg 0.28525 wandb: train_loss_cls 0.26768 wandb: train_loss_epoch 0.64183 wandb: train_loss_iter 0.23688 wandb: train_loss_obj 0.04039 wandb: train_loss_rpn 0.04852 wandb: val_map_05 0.45273 wandb: val_map_05_95 0.22893

This is the result using eval.py:

{'classes': tensor([1, 2, 3, 4, 5, 6, 7], dtype=torch.int32), 'map': tensor(0.1700), 'map_50': tensor(0.3747), 'map_75': tensor(0.1317), 'map_large': tensor(0.2711), 'map_medium': tensor(0.1956), 'map_per_class': tensor(-1.), 'map_small': tensor(0.2009), 'mar_1': tensor(0.1268), 'mar_10': tensor(0.3182), 'mar_100': tensor(0.4018), 'mar_100_per_class': tensor(-1.), 'mar_large': tensor(0.5470), 'mar_medium': tensor(0.4388), 'mar_small': tensor(0.3780)}

Thus, I am hypothesizing that the problem might be caused from the image size. Please let me know your thoughts on this as well. Thank you.

sovit-123 commented 1 year ago

I am laying out all the details here. You should use --square-training in evaluation only if you have used it in training. Please try with --square-training with both training and evaluation with 320 size once. It will further help me to debug.

Also, it is very odd that it is only happening with 320 size and not with 640. I had not expected that and did not test for that as well.

Please try with 320 and --square-training and let me know.

In short, square training will resize the images into 320x320 or 640x640 depending on the value. Else, aspect ratio resizing happens depending on the size of --imgsz

JerickoDG commented 1 year ago

Hi, I tried adding --square-training with --imgsz 320 and the mAPs were still inconsistent as they still varied. Here are the results:

From wandfb:

wandb: Waiting for W&B process to finish... (success). wandb: wandb: Run history: wandb: train_loss_box_reg █▃▁ wandb: train_loss_cls █▃▁ wandb: train_loss_epoch █▃▁ wandb: train_loss_iter █▂▃▇▃▂▄▇▃▃▄▂▃▄▃▂▆▆▄▃▃▄▄▃▆▁▃▃▂▃▁▃▂▂▃▃▃▄▅▁ wandb: train_loss_obj █▄▁ wandb: train_loss_rpn █▄▁ wandb: val_map_05 ▁▅█ wandb: val_map_05_95 ▁▄█ wandb: wandb: Run summary: wandb: train_loss_box_reg 0.29395 wandb: train_loss_cls 0.26429 wandb: train_loss_epoch 0.6994 wandb: train_loss_iter 0.34632 wandb: train_loss_obj 0.06918 wandb: train_loss_rpn 0.07198 wandb: val_map_05 0.40372 wandb: val_map_05_95 0.19856

From eval.py:

{'classes': tensor([1, 2, 3, 4, 5, 6, 7], dtype=torch.int32), 'map': tensor(0.1345), 'map_50': tensor(0.2804), 'map_75': tensor(0.1145), 'map_large': tensor(0.1694), 'map_medium': tensor(0.1385), 'map_per_class': tensor(-1.), 'map_small': tensor(0.1547), 'mar_1': tensor(0.1034), 'mar_10': tensor(0.2646), 'mar_100': tensor(0.3302), 'mar_100_per_class': tensor(-1.), 'mar_large': tensor(0.4425), 'mar_medium': tensor(0.2960), 'mar_small': tensor(0.3243)}

Let me confirm with --imgsz 160. I will also recheck with --imgsz 640 just to be sure. I'll provide an update shortly.

JerickoDG commented 1 year ago

Hi. I tried again with --imgsz 640 with --square-training. The mAPs were relatively close to each other by decimal point amount. However, when I adjust the value to --imgsz 160, the mAPs were became different between the wandb results and eval.py results.

sovit-123 commented 1 year ago

Hello. Can you open the training log file and check the best mAP results? I have a feeling that WandB is reporting the best model results while you are evaluating using the last model. I may be wrong though. Please keep me posted.

JerickoDG commented 1 year ago

Sure. This is for the train.log that was generated when the training was resumed. For the previously experimented and observed --imgsz 320 with --square-training, the mAPs from the train.log file and mAPs shown from the wandb output were the same. Also, on the train.log, the last epoch (fifth) had the highest mAP compard to previous epochs.

sovit-123 commented 1 year ago

Ok. I will take a look.

JerickoDG commented 1 year ago

Okay, thank you very much.

sovit-123 commented 1 year ago

Hi. I did a thorough check and it seems the issue is not with pycocotools/torchmetrics. Here are the evaluation results from both: Torchmetrics

{'classes': tensor([1, 2, 3, 4, 5, 6, 7], dtype=torch.int32),
 'map': tensor(0.1302),
 'map_50': tensor(0.2933),
 'map_75': tensor(0.0970),
 'map_large': tensor(0.1804),
 'map_medium': tensor(0.1303),
 'map_per_class': tensor([0.1623, 0.2512, 0.0655, 0.0846, 0.0162, 0.1514, 0.1802]),
 'map_small': tensor(0.1129),
 'mar_1': tensor(0.1116),
 'mar_10': tensor(0.2763),
 'mar_100': tensor(0.3382),
 'mar_100_per_class': tensor([0.3854, 0.4335, 0.2587, 0.2509, 0.1784, 0.4788, 0.3815]),
 'mar_large': tensor(0.4500),
 'mar_medium': tensor(0.3378),
 'mar_small': tensor(0.3006)}

pycocotools

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.130
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.293
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.097
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.113
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.131
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.180
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.112
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.276
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.338
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.301
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.337
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.450

Somehow the last epoch results and the evaluation script results are not matching when image size is less than 640. At this point, I will need to dig deeper into what is happening.

JerickoDG commented 1 year ago

Hi. Thanks for your update.

May I ask how did you use Torchmetrics to get the correct mAP metrics when the image size is less than 640? So I can try it myself while you're digging deeper to what is happening?

sovit-123 commented 1 year ago

No, I said the results are not matching when the image size is less than 640. When the image size is 640 or higher they are matching.

I hope this is clear.

Let me know if you have any more questions.

sovit-123 commented 1 year ago

Oh. Wait. Are you asking about the matching numbers between torchmetrics and pycocotools?

JerickoDG commented 1 year ago

Apologies, let me edit my reply there. Yes, I would like to try what you did as well. Did you still use eval.py for that?

sovit-123 commented 1 year ago

So, I just used the latest version of torchmetrics which uses pycocotools as backend by default. You can update yours as well. it is more reliable.

JerickoDG commented 1 year ago

Okay, noted. Apologies as I am not too knowledgeable about the library/framework but may I ask if I shall still use the eval.py after updating torchmetrics or do I need to create a script that uses this module from torchmetrics: https://torchmetrics.readthedocs.io/en/stable/detection/mean_average_precision.html

sovit-123 commented 1 year ago

You can use Torchmetrics. It's pretty good now.

sovit-123 / fasterrcnn-pytorch-training-pipeline

Validation mAP not consistent #109