sovit-123 / fasterrcnn-pytorch-training-pipeline

PyTorch Faster R-CNN Object Detection on Custom Dataset
MIT License
205 stars 66 forks source link

Number of training and validation samples get doubled at training process #65

Open huyhuyvu01 opened 1 year ago

huyhuyvu01 commented 1 year ago

As the title stated, when training with custom dataset, the train and validation samples got double in the command promt, which I think affect the training speed and the accuracy of the training process. My dataset consists of 361 train samples and 42 validation samples.

Here is my train promt: python train.py --data data_configs/traffic.yaml --epochs 50 --model fasterrcnn_resnet50_fpn --name trafficSign_detection_no_bg --batch 12

Here my dataconfig files: TRAIN_DIR_IMAGES: data/traffic-sign/train/images TRAIN_DIR_LABELS: data/traffic-sign/train/annotations VALID_DIR_IMAGES: data/traffic-sign/valid/images VALID_DIR_LABELS: data/traffic-sign/valid/annotations

CLASSES: [ 'cam_di_nguoc_chieu', 'cam_oto', 'cam_oto_re_phai', 'cam_mo_to', 'cam_oto_va_moto', 'cam_ng_di_bo', 'cam_re_trai', 'cam_re_phai', 'cam_quay_dau_trai', 'max_spd_40', 'max_spd_50', 'max_spd_60', 'max_spd_80', 'cam_dung_do', 'cam_do', 'duong_giao_nhau', 'giao_nhau_vs_ko_uu_tien', 'giao_nhau_vs_ko_uu_tien_trai', 'giao_nhau_vs_uu_tien', 'dg_co_ng_di_bo_cat_ngang', 'tre_em_qua_duong', 'cong_truong', 'day_cap', 'slow', 'huong_phai_di', 'danh_cho_ng_di_bo', 'dg_mot_chieu', 'dg_cho_oto',

'background',

] NC: 28 SAVE_VALID_PREDICTION_IMAGES: True`

the training process take a very long time of 2hour plus on RTX 3060, and the precision, the mAP is very low, around 0.23 for 50 epochs with batch_size of 12.

sovit-123 commented 1 year ago

Can you paste the output of the first few lines here?

huyhuyvu01 commented 1 year ago

Can you paste the output of the first few lines here?

Here is the first few lines of the cmd

(DeepLearning) D:\Knowledge\MachineLearning\FasterRCNN>python train.py --data data_configs/traffic.yaml --epochs 50 --model fasterrcnn_resnet50_fpn --name trafficSign_detection_no_bg --batch 12 Not using distributed mode wandb: Currently logged in as: huyhuyvu01. Use wandb login --relogin to force relogin wandb: Tracking run with wandb version 0.15.0 wandb: Run data is saved locally in D:\Knowledge\MachineLearning\FasterRCNN\wandb\run-20230429_185442-lb7qygux wandb: Run wandb offline to turn off syncing. wandb: Syncing run trafficSign_detection_no_bg wandb: View project at https://wandb.ai/huyhuyvu01/uncategorized wandb: View run at https://wandb.ai/huyhuyvu01/uncategorized/runs/lb7qygux device cuda Creating data loaders Number of training samples: 722 Number of validation samples: 84

Building model from scratch... C:\Users\huyhu\anaconda3\envs\DeepLearning\lib\site-packages\torchvision\models_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( C:\Users\huyhu\anaconda3\envs\DeepLearning\lib\site-packages\torchvision\models_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=FasterRCNN_ResNet50_FPN_Weights.COCO_V1. You can also use weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) C:\Users\huyhu\anaconda3\envs\DeepLearning\lib\site-packages\torchinfo\torchinfo.py:477: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() action_fn=lambda data: sys.getsizeof(data.storage()), C:\Users\huyhu\anaconda3\envs\DeepLearning\lib\site-packages\torch\storage.py:665: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return super().sizeof() + self.nbytes()

Layer (type:depth-idx) Output Shape Param #

FasterRCNN [100, 4] -- ├─GeneralizedRCNNTransform: 1-1 [12, 3, 800, 800] -- ├─BackboneWithFPN: 1-2 [12, 256, 13, 13] -- │ └─IntermediateLayerGetter: 2-1 [12, 2048, 25, 25] -- │ │ └─Conv2d: 3-1 [12, 64, 400, 400] (9,408) │ │ └─FrozenBatchNorm2d: 3-2 [12, 64, 400, 400] -- │ │ └─ReLU: 3-3 [12, 64, 400, 400] -- │ │ └─MaxPool2d: 3-4 [12, 64, 200, 200] -- │ │ └─Sequential: 3-5 [12, 256, 200, 200] (212,992) │ │ └─Sequential: 3-6 [12, 512, 100, 100] 1,212,416 │ │ └─Sequential: 3-7 [12, 1024, 50, 50] 7,077,888 │ │ └─Sequential: 3-8 [12, 2048, 25, 25] 14,942,208 │ └─FeaturePyramidNetwork: 2-2 [12, 256, 13, 13] -- │ │ └─ModuleList: 3-15 -- (recursive) │ │ └─ModuleList: 3-16 -- (recursive) │ │ └─ModuleList: 3-15 -- (recursive) │ │ └─ModuleList: 3-16 -- (recursive) │ │ └─ModuleList: 3-15 -- (recursive) │ │ └─ModuleList: 3-16 -- (recursive) │ │ └─ModuleList: 3-15 -- (recursive) │ │ └─ModuleList: 3-16 -- (recursive) │ │ └─LastLevelMaxPool: 3-17 [12, 256, 200, 200] -- ├─RegionProposalNetwork: 1-3 [1000, 4] -- │ └─RPNHead: 2-3 [12, 3, 200, 200] -- │ │ └─Sequential: 3-18 [12, 256, 200, 200] 590,080 │ │ └─Conv2d: 3-19 [12, 3, 200, 200] 771 │ │ └─Conv2d: 3-20 [12, 12, 200, 200] 3,084 │ │ └─Sequential: 3-21 [12, 256, 100, 100] (recursive) │ │ └─Conv2d: 3-22 [12, 3, 100, 100] (recursive) │ │ └─Conv2d: 3-23 [12, 12, 100, 100] (recursive) │ │ └─Sequential: 3-24 [12, 256, 50, 50] (recursive) │ │ └─Conv2d: 3-25 [12, 3, 50, 50] (recursive) │ │ └─Conv2d: 3-26 [12, 12, 50, 50] (recursive) │ │ └─Sequential: 3-27 [12, 256, 25, 25] (recursive) │ │ └─Conv2d: 3-28 [12, 3, 25, 25] (recursive) │ │ └─Conv2d: 3-29 [12, 12, 25, 25] (recursive) │ │ └─Sequential: 3-30 [12, 256, 13, 13] (recursive) │ │ └─Conv2d: 3-31 [12, 3, 13, 13] (recursive) │ │ └─Conv2d: 3-32 [12, 12, 13, 13] (recursive) │ └─AnchorGenerator: 2-4 [159882, 4] -- ├─RoIHeads: 1-4 [100, 4] -- │ └─MultiScaleRoIAlign: 2-5 [12000, 256, 7, 7] -- │ └─TwoMLPHead: 2-6 [12000, 1024] -- │ │ └─Linear: 3-33 [12000, 1024] 12,846,080 │ │ └─Linear: 3-34 [12000, 1024] 1,049,600 │ └─FastRCNNPredictor: 2-7 [12000, 28] -- │ │ └─Linear: 3-35 [12000, 28] 28,700 │ │ └─Linear: 3-36 [12000, 112] 114,800

Total params: 41,432,411 Trainable params: 41,210,011 Non-trainable params: 222,400 Total mult-adds (T): 1.61

Input size (MB): 58.98 Forward/backward pass size (MB): 17816.70 Params size (MB): 165.73 Estimated Total Size (MB): 18041.42

41,432,411 total parameters. 41,210,011 training parameters. Epoch: [0] [ 0/61] eta: 0:14:41 lr: 0.000018 loss: 4.2775 (4.2775) loss_classifier: 3.5476 (3.5476) loss_box_reg: 0.0632 (0.0632) loss_objectness: 0.3510 (0.3510) loss_rpn_box_reg: 0.3157 (0.3157) time: 14.4483 data: 12.7076 max mem: 8579

sovit-123 commented 1 year ago

That's odd. I never faced that issue. Just to be sure, can you please recheck your image directory once more?

huyhuyvu01 commented 1 year ago

image Here is the corresponding directory, the location is based from the train.py file which is in the FasterRCNN folder

image The file name and file structure of the label is also correct.

image There are also another folder called test in the data directory too, but I don't thinks it the cause of the issue.

https://universe.roboflow.com/ictu/vietnam-traffic-signs-detection2/dataset/1 Here is the link to my dataset incase you need it

sovit-123 commented 1 year ago

Thanks. Will check it out.