Open JerickoDG opened 1 year ago
Hi. There are a few things to consider here. If the dataset YAML file contains a test set, then the evaluation script will automatically use the test set for evaluation. Else, it uses the validation set.
However, it seems that you have already changed that. Then only one thing remains. Can you please check the log file from the training output directory and see whether the final epoch's log matches the evaluation logs? I will need to check as well. However, from my multiple experiments, I can confirm that the mAP should be the same. Instead of the WandB logs, please check the model logs from the directory once.
Hi. Thank you for your response. By _modellogs, do you mean the train.log file? If yes, it did not have any contents (0 KB file size). I am currently guessing that it was overwritten when I executed:
python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --resume
I just executed that resume training command whenever there was an error wherein the training stopped.
Thus, I think that putting the --name custom_training
might've overwritten the train.logs since the training stopped (due to an error) at the last epoch when saving the best model. However, the last epoch (20th) was recorded to the results.csv file (meaning it was finished) which could mean that when I entered the aforementioned command, the train.log was overwritten with nothing or blank because there was nothing to continue because it already reached the last epoch.
That is only my guess on why the content of the train.log is empty or blank. Please share your thoughts if you think so too or have other guesses.
Yes, that may be the reason it is blank.
Also, please check that the --imgsz
is 320 during training. They should be the same during training and evaluation. Otherwise, the numbers will be different.
I will also check one thing from my side whether WandB at the end shows the mAP from the best epoch or the last epoch. Because from multiple experiments I know that the numbers should match.
Yes, that may be the reason it is blank.
With that said, should I omit the --name custom_training
to my resume training command so it stops overwriting the train.logs from the epoch I resumed up to the epoch I stopped?
Also, please check that the
--imgsz
is 320 during training. They should be the same during training and evaluation. Otherwise, the numbers will be different.I will also check one thing from my side whether WandB at the end shows the mAP from the best epoch or the last epoch. Because from multiple experiments I know that the numbers should match.
The image size on my opt.yaml file is 320x320. This is the content of the file:
model: fasterrcnn_resnet50_fpn_v2 data: data_configs\custom_data.yaml device: cuda epochs: 20 workers: 4 batch: 4 lr: 0.001 imgsz: 320 name: custom_training vis_transformed: false mosaic: 0.0 use_train_aug: false cosine_annealing: false weights: outputs\training\custom_training\last_model.pth resume_training: true square_training: false world_size: 1 dist_url: env:// disable_wandb: false sync_bn: false amp: false seed: 0 project_dir: null distributed: false
Yes, you can give a different name. It will resume training but the resulting directory will be different. You will separate logs for both the training for later analysis.
Yes, you can give a different name. It will resume training but the resulting directory will be different. You will separate logs for both the training for later analysis.
Noted. May I ask if the graphs (.png files of mAP and losses) will still be consistent or continuous across those separate resulting directories?
For instance, in output/training/res_1
, I trained my model from epochs 1 to 5. I stopped it.
After that, I resumed it from epochs 6 to 10 to be stored in output/training/res_2
. Then stopped it again to resume training the next day.
Finally, I resumed it from epochs 11 to 20 (stored in output/training/res_3
) to end the training.
Will the graphs (.png files of mAP and losses) still be consistent or continuous in that case?
Yes, they will be consistent. The train loss list, train loss epoch, and epoch graphs will be consistent. The individual losses will be created new each time.
Okay, noted. I will try to train again with stop and resume then observe and provide an update about the outcome, especially for the consistency of mAP output from wandb and local model directory. Thank you very much again for your answers.
No issues. Let me know.
Hi. I would like to provide an update regarding my observation.
I tried training the model for 4 epochs. I stopped the training during the 3rd epoch (2 if zero-indexed) and resumed up to the fourth and final epoch using the command:
python train.py --data data_configs\custom_data.yaml --weights outputs\training\res_1\last_model.pth --model fasterrcnn_resnet50_fpn_v2 --imgsz 320 --epochs 4 --resume
Thus, there were two folders in outputs/training which are res_1 and res_2. I noticed the ff:
train.log
only contains the training info logs from the third and fourth epoch (2 and 3 if zero-indexed). The same goes for the content of results.csv in res_2
folder. mAP.png
, train_loss_epoch.png
, and train_loss_iter.png
shown a continuous graph but train_loss_bbox_reg.png
, train_loss_cls.png
, train_loss_obj.png
, and train_loss_rpn_bbox.png
only had two points in the image graph (for res_2, I assume that they are from third and fourth epochs (2 and 3 if zero-indexed).The results of the final epoch (4th) from the wandb are the ff:
wandb: Run summary: wandb: train_loss_box_reg 0.03585 wandb: train_loss_cls 0.01717 wandb: train_loss_epoch 0.05724 wandb: train_loss_iter 0.0605 wandb: train_loss_obj 0.00183 wandb: train_loss_rpn 0.00238 wandb: val_map_05 0.88586 wandb: val_map_05_95 0.5767
The results of the final epoch (4th) using the eval.py with the command:
python eval.py --data data_configs/custom_data.yaml --weights outputs/training/res_2/last_model.pth --model fasterrcnn_resnet50_fpn_v2 --imgsz 320 --verbose
{'classes': tensor([1, 2], dtype=torch.int32),
'map': tensor(0.3774),
'map_50': tensor(0.7105),
'map_75': tensor(0.3364),
'map_large': tensor(0.4573),
'map_medium': tensor(0.3382),
'map_per_class': tensor([0.4867, 0.2680]),
'map_small': tensor(0.2141),
'mar_1': tensor(0.4326),
'mar_10': tensor(0.5235),
'mar_100': tensor(0.5241),
'mar_100_per_class': tensor([0.6467, 0.4016]),
'mar_large': tensor(0.5880),
'mar_medium': tensor(0.4661),
'mar_small': tensor(0.3583)}
As can be seen, the results for mAP@0.5 and mAP@0.5-0.9 were still different for the validation
data.
For additional information, here are the contents of the opt.yaml
from res_2
folder:
model: fasterrcnn_resnet50_fpn_v2 data: data_configs\custom_data.yaml device: cuda epochs: 4 workers: 4 batch: 4 lr: 0.001 imgsz: 320 name: null vis_transformed: false mosaic: 0.0 use_train_aug: false cosine_annealing: false weights: outputs\training\res_1\last_model.pth resume_training: true square_training: false world_size: 1 dist_url: env:// disable_wandb: false sync_bn: false amp: false seed: 0 project_dir: null distributed: false
And here is the contents of the __custom_data.yaml__:
**TRAIN_DIR_IMAGES: 'custom_data/train' TRAIN_DIR_LABELS: 'custom_data/train' VALID_DIR_IMAGES: 'custom_data/valid' VALID_DIR_LABELS: 'custom_data/valid' valid_DIR_IMAGES: 'custom_data/valid' valid_DIR_LABELS: 'custom_data/valid'
CLASSES: [ 'background', 'handgun', 'knife' ]
NC: 3
SAVE_VALID_PREDICTION_IMAGES: True**
As of now, I am thinking that the problem might come from when the training pipeline resumes the training from the most recent epoch. But I am not sure specifically why and where. Because I tried training continuously (no stop and resume) and the mAPs (0.5 and 0.5-0.9) from wandb and eval.py for validation
set were equal or at least very near each other.
May I know your thoughts about this and the possible fix we can do? I hope for your response. Thank you very much.
If you are getting same mAP with continuous training and not with resume training, then I will take a look. I have not compared mAP values with resume training yet, so I guess I missed this issue. Thanks for bringing it up.
For now, I can say that you can safely resume training. Just be sure to not rely on the WandB logs and run eval.py
using the best saved model. I am sure that you will have no issues with that.
Let me know if you want to keep this issue open or I can close it and keep you posted on the progress.
Okay, noted. I would like to keep this issue open until it is resolved and please keep me posted on this thread as well. I would like to try training (with stop and resume) and observing again as soon as the fixed version gets pushed and merged with the main branch.
Thank you very much.
So, I stopped training and resumed. This is from WandB from the last epoch after resuming.
wandb: Waiting for W&B process to finish... (success).
wandb: | 170.694 MB of 170.694 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: train_loss_box_reg █▃▁
wandb: train_loss_cls █▃▁
wandb: train_loss_epoch █▃▁
wandb: train_loss_iter █▃▃▃▃█▄▅▃▆▅▃▂▅▂▁▅▂▁▂▃▃▃▁▂▃▃▂▂▂▃▅▃▃▂▂▂▁▂▅
wandb: train_loss_obj █▃▁
wandb: train_loss_rpn █▄▁
wandb: val_map_05 ▁▄█
wandb: val_map_05_95 ▁▄█
wandb:
wandb: Run summary:
wandb: train_loss_box_reg 0.28041
wandb: train_loss_cls 0.2824
wandb: train_loss_epoch 0.62553
wandb: train_loss_iter 1.20801
wandb: train_loss_obj 0.03414
wandb: train_loss_rpn 0.02859
wandb: val_map_05 0.45671
wandb: val_map_05_95 0.23937
And this is after running evaluation using the following command.
python eval.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs/training/resume_test_2/last_model_state.pth --data data_configs/aquarium.yaml --imgsz 640 --square-training
'map': tensor(0.2399),
'map_50': tensor(0.4626),
I think they are pretty close. Whatever floating difference is there maybe because during training/validation loop it uses pycocotools and eval.py
uses torchmetrics. But I think, the mAP is consistent.
Let me know your thoughts.
Hi. Thanks for the update. May I if the --square-training
is recommended to have for the eval.py
command?
May I also ask for your initial training command, how you stopped the initial training, then the resume training command you used? Maybe the probable cause lies there, just a guess for now though.
May I also take a look at the custom_data.yaml
you configured and used?
Sure. I used the aquarium dataset whose YAML file already comes with the repository. I commented out the test paths.
# Images and labels direcotry should be relative to train.py
TRAIN_DIR_IMAGES: '../input/Aquarium Combined.v2-raw-1024.voc/train'
TRAIN_DIR_LABELS: '../input/Aquarium Combined.v2-raw-1024.voc/train'
VALID_DIR_IMAGES: '../input/Aquarium Combined.v2-raw-1024.voc/valid'
VALID_DIR_LABELS: '../input/Aquarium Combined.v2-raw-1024.voc/valid'
# Optional test data path. If given, test paths (data) will be used in
# `eval.py`.
#TEST_DIR_IMAGES: '../input/Aquarium Combined.v2-raw-1024.voc/test'
#TEST_DIR_LABELS: '../input/Aquarium Combined.v2-raw-1024.voc/test'
# Class names.
CLASSES: [
'__background__',
'fish', 'jellyfish', 'penguin',
'shark', 'puffin', 'stingray',
'starfish'
]
# Number of classes (object classes + 1 for background class in Faster RCNN).
NC: 8
# Whether to save the predictions of the validation set while training.
SAVE_VALID_PREDICTION_IMAGES: True
Hi, I was experimenting with the aquarium dataset using Google Colab. I had same result as yours when --imgsz 640
which is according to the script is by default. However, when I adjusted the image size using --imgsz 320
, the mAPs became inconsistent. I stopped at the third epoch and finished from there up to the fifth and last epoch.
Initial Training command: !python /content/fastercnn-pytorch-training-pipeline/train.py --data /content/fastercnn-pytorch-training-pipeline/data_configs/custom_data.yaml --epochs 5 --model fasterrcnn_resnet50_fpn_v2 --name custom_training_320 --batch 4 --imgsz 320
Resume Training command: !python /content/fastercnn-pytorch-training-pipeline/train.py --data /content/fastercnn-pytorch-training-pipeline/data_configs/custom_data.yaml --weights /content/outputs/training/custom_training_320/last_model.pth --epochs 5 --model fasterrcnn_resnet50_fpn_v2 --name custom_training_320_2 --batch 4 --imgsz 320 --resume
Evaluation command: !python /content/fastercnn-pytorch-training-pipeline/eval.py --model fasterrcnn_resnet50_fpn_v2 --weights /content/outputs/training/custom_training_320_2/last_model_state.pth --data /content/fastercnn-pytorch-training-pipeline/data_configs/custom_data.yaml --imgsz 320
This is the result from wandb (from resumed epoch (third) up to the fifth and last epoch):
wandb: Waiting for W&B process to finish... (success). wandb: wandb: Run history: wandb: train_loss_box_reg █▃▁ wandb: train_loss_cls █▃▁ wandb: train_loss_epoch █▃▁ wandb: train_loss_iter █▃▄▆▃▃▄▆▃▄▄▃▃▄▃▃▆▄▄▃▄▅▄▃▅▁▃▃▂▃▂▃▃▂▃▃▃▄▄▁ wandb: train_loss_obj █▃▁ wandb: train_loss_rpn █▄▁ wandb: val_map_05 ▁▅█ wandb: val_map_05_95 ▁▄█ wandb: wandb: Run summary: wandb: train_loss_box_reg 0.28525 wandb: train_loss_cls 0.26768 wandb: train_loss_epoch 0.64183 wandb: train_loss_iter 0.23688 wandb: train_loss_obj 0.04039 wandb: train_loss_rpn 0.04852 wandb: val_map_05 0.45273 wandb: val_map_05_95 0.22893
This is the result using eval.py
:
{'classes': tensor([1, 2, 3, 4, 5, 6, 7], dtype=torch.int32), 'map': tensor(0.1700), 'map_50': tensor(0.3747), 'map_75': tensor(0.1317), 'map_large': tensor(0.2711), 'map_medium': tensor(0.1956), 'map_per_class': tensor(-1.), 'map_small': tensor(0.2009), 'mar_1': tensor(0.1268), 'mar_10': tensor(0.3182), 'mar_100': tensor(0.4018), 'mar_100_per_class': tensor(-1.), 'mar_large': tensor(0.5470), 'mar_medium': tensor(0.4388), 'mar_small': tensor(0.3780)}
Thus, I am hypothesizing that the problem might be caused from the image size. Please let me know your thoughts on this as well. Thank you.
I am laying out all the details here.
You should use --square-training
in evaluation only if you have used it in training. Please try with --square-training
with both training and evaluation with 320 size once. It will further help me to debug.
Also, it is very odd that it is only happening with 320 size and not with 640. I had not expected that and did not test for that as well.
Please try with 320 and --square-training
and let me know.
In short, square training will resize the images into 320x320 or 640x640 depending on the value. Else, aspect ratio resizing happens depending on the size of --imgsz
Hi, I tried adding --square-training
with --imgsz 320
and the mAPs were still inconsistent as they still varied. Here are the results:
From wandfb:
wandb: Waiting for W&B process to finish... (success). wandb: wandb: Run history: wandb: train_loss_box_reg █▃▁ wandb: train_loss_cls █▃▁ wandb: train_loss_epoch █▃▁ wandb: train_loss_iter █▂▃▇▃▂▄▇▃▃▄▂▃▄▃▂▆▆▄▃▃▄▄▃▆▁▃▃▂▃▁▃▂▂▃▃▃▄▅▁ wandb: train_loss_obj █▄▁ wandb: train_loss_rpn █▄▁ wandb: val_map_05 ▁▅█ wandb: val_map_05_95 ▁▄█ wandb: wandb: Run summary: wandb: train_loss_box_reg 0.29395 wandb: train_loss_cls 0.26429 wandb: train_loss_epoch 0.6994 wandb: train_loss_iter 0.34632 wandb: train_loss_obj 0.06918 wandb: train_loss_rpn 0.07198 wandb: val_map_05 0.40372 wandb: val_map_05_95 0.19856
From eval.py
:
{'classes': tensor([1, 2, 3, 4, 5, 6, 7], dtype=torch.int32), 'map': tensor(0.1345), 'map_50': tensor(0.2804), 'map_75': tensor(0.1145), 'map_large': tensor(0.1694), 'map_medium': tensor(0.1385), 'map_per_class': tensor(-1.), 'map_small': tensor(0.1547), 'mar_1': tensor(0.1034), 'mar_10': tensor(0.2646), 'mar_100': tensor(0.3302), 'mar_100_per_class': tensor(-1.), 'mar_large': tensor(0.4425), 'mar_medium': tensor(0.2960), 'mar_small': tensor(0.3243)}
Let me confirm with --imgsz 160
. I will also recheck with --imgsz 640
just to be sure. I'll provide an update shortly.
Hi. I tried again with --imgsz 640
with --square-training
. The mAPs were relatively close to each other by decimal point amount. However, when I adjust the value to --imgsz 160
, the mAPs were became different between the wandb results and eval.py
results.
Hello. Can you open the training log file and check the best mAP results? I have a feeling that WandB is reporting the best model results while you are evaluating using the last model. I may be wrong though. Please keep me posted.
Sure. This is for the train.log
that was generated when the training was resumed. For the previously experimented and observed --imgsz 320
with --square-training
, the mAPs from the train.log
file and mAPs shown from the wandb output were the same. Also, on the train.log
, the last epoch (fifth) had the highest mAP compard to previous epochs.
Ok. I will take a look.
Okay, thank you very much.
Hi. I did a thorough check and it seems the issue is not with pycocotools/torchmetrics. Here are the evaluation results from both: Torchmetrics
{'classes': tensor([1, 2, 3, 4, 5, 6, 7], dtype=torch.int32),
'map': tensor(0.1302),
'map_50': tensor(0.2933),
'map_75': tensor(0.0970),
'map_large': tensor(0.1804),
'map_medium': tensor(0.1303),
'map_per_class': tensor([0.1623, 0.2512, 0.0655, 0.0846, 0.0162, 0.1514, 0.1802]),
'map_small': tensor(0.1129),
'mar_1': tensor(0.1116),
'mar_10': tensor(0.2763),
'mar_100': tensor(0.3382),
'mar_100_per_class': tensor([0.3854, 0.4335, 0.2587, 0.2509, 0.1784, 0.4788, 0.3815]),
'mar_large': tensor(0.4500),
'mar_medium': tensor(0.3378),
'mar_small': tensor(0.3006)}
pycocotools
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.130
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.293
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.097
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.113
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.131
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.180
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.112
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.276
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.338
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.301
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.337
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.450
Somehow the last epoch results and the evaluation script results are not matching when image size is less than 640. At this point, I will need to dig deeper into what is happening.
Hi. Thanks for your update.
May I ask how did you use Torchmetrics to get the correct mAP metrics when the image size is less than 640? So I can try it myself while you're digging deeper to what is happening?
No, I said the results are not matching when the image size is less than 640. When the image size is 640 or higher they are matching.
I hope this is clear.
Let me know if you have any more questions.
Oh. Wait. Are you asking about the matching numbers between torchmetrics and pycocotools?
Apologies, let me edit my reply there. Yes, I would like to try what you did as well. Did you still use eval.py
for that?
So, I just used the latest version of torchmetrics which uses pycocotools as backend by default. You can update yours as well. it is more reliable.
Okay, noted. Apologies as I am not too knowledgeable about the library/framework but may I ask if I shall still use the eval.py
after updating torchmetrics or do I need to create a script that uses this module from torchmetrics: https://torchmetrics.readthedocs.io/en/stable/detection/mean_average_precision.html
You can use Torchmetrics. It's pretty good now.
After finishing 20 epochs, the terminal displayed this:
wandb: Run summary: wandb: train_loss_box_reg 0.01077 wandb: train_loss_cls 0.00603 wandb: train_loss_epoch 0.01768 wandb: train_loss_iter 0.01653 wandb: train_loss_obj 0.00029 wandb: train_loss_rpn 0.00059 wandb: val_map_05 0.87624 wandb: val_map_05_95 0.58491
As can be seen, the mAP@0.5 and mAP@0.5-0.9 are relatively high for the last model state. To verify it, I tried evaluating the _lastmodel.pth using the eval.py script with this command:
python eval.py --weights outputs/training/custom_training/last_model.pth --data data_configs/custom_data.yaml --model fasterrcnn_resnet50_fpn_v2 --imgsz 320 --verbose
Which gave me this result:
{'classes': tensor([1, 2], dtype=torch.int32), 'map': tensor(0.3669), 'map_50': tensor(0.6886), 'map_75': tensor(0.3405), 'map_large': tensor(0.4444), 'map_medium': tensor(0.3408), 'map_per_class': tensor([0.4621, 0.2718]), 'map_small': tensor(0.2078), 'mar_1': tensor(0.4233), 'mar_10': tensor(0.4997), 'mar_100': tensor(0.5007), 'mar_100_per_class': tensor([0.6162, 0.3851]), 'mar_large': tensor(0.5776), 'mar_medium': tensor(0.4468), 'mar_small': tensor(0.3199)}
As can be seen, the resulting mAP@0.5 and mAP@0.5-0.9 were different from each other. The mAPs shown from the eval.py were lower.
Also, it seems that the script automatically uses the directory or path TEST_DIR_LABELS if it exists so I changed it to the directory or path of validation set like this:
TRAIN_DIR_IMAGES: 'custom_data/train' TRAIN_DIR_LABELS: 'custom_data/train' VALID_DIR_IMAGES: 'custom_data/valid' VALID_DIR_LABELS: 'custom_data/valid' TEST_DIR_IMAGES: 'custom_data/valid' TEST_DIR_LABELS: 'custom_data/valid'
I am not sure where I went wrong. Do you happen to know the probable cause?