I want to run my training process with the command:

!python train.py --data data_configs/data_training.yaml --epochs 40 --model fasterrcnn_mobilenetv3_large_fpn --project-dir fasterrcnn_mobilenetv3_large_fpn --seed 8

and I get an error in my program as follows:

2024-06-24 15:23:20.794655: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-06-24 15:23:20.794717: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-06-24 15:23:20.796062: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-06-24 15:23:20.803158: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-06-24 15:23:21.919523: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Not using distributed mode wandb: Currently logged in as: pusatstudiaiunsulbar (pusatsudiaiusb). Use wandb login --relogin to force relogin wandb: Tracking run with wandb version 0.17.2 wandb: Run data is saved locally in /content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/wandb/run-20240624_152326-bw79izjd wandb: Run wandb offline to turn off syncing. wandb: Syncing run expert-fire-4 wandb: ⭐️ View project at https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline wandb: 🚀 View run at https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline/runs/bw79izjd device cuda Checking Labels and images... 100% 886/886 [00:00<00:00, 116878.55it/s] Checking Labels and images... 0it [00:00, ?it/s] Creating data loaders /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Number of training samples: 886 Number of validation samples: 0

Building model from scratch...

Layer (type (var_name)) Input Shape Output Shape Param #

FasterRCNN (FasterRCNN) [4, 3, 640, 640] [0, 4] -- ├─GeneralizedRCNNTransform (transform) [4, 3, 640, 640] [4, 3, 640, 640] -- ├─BackboneWithFPN (backbone) [4, 3, 640, 640] [4, 256, 10, 10] -- │ └─IntermediateLayerGetter (body) [4, 3, 640, 640] [4, 960, 20, 20] -- │ │ └─Conv2dNormActivation (0) [4, 3, 640, 640] [4, 16, 320, 320] (432) │ │ └─InvertedResidual (1) [4, 16, 320, 320] [4, 16, 320, 320] (400) │ │ └─InvertedResidual (2) [4, 16, 320, 320] [4, 24, 160, 160] (3,136) │ │ └─InvertedResidual (3) [4, 24, 160, 160] [4, 24, 160, 160] (4,104) │ │ └─InvertedResidual (4) [4, 24, 160, 160] [4, 40, 80, 80] (9,960) │ │ └─InvertedResidual (5) [4, 40, 80, 80] [4, 40, 80, 80] (20,432) │ │ └─InvertedResidual (6) [4, 40, 80, 80] [4, 40, 80, 80] (20,432) │ │ └─InvertedResidual (7) [4, 40, 80, 80] [4, 80, 40, 40] 30,960 │ │ └─InvertedResidual (8) [4, 80, 40, 40] [4, 80, 40, 40] 33,800 │ │ └─InvertedResidual (9) [4, 80, 40, 40] [4, 80, 40, 40] 31,096 │ │ └─InvertedResidual (10) [4, 80, 40, 40] [4, 80, 40, 40] 31,096 │ │ └─InvertedResidual (11) [4, 80, 40, 40] [4, 112, 40, 40] 212,280 │ │ └─InvertedResidual (12) [4, 112, 40, 40] [4, 112, 40, 40] 383,208 │ │ └─InvertedResidual (13) [4, 112, 40, 40] [4, 160, 20, 20] 426,216 │ │ └─InvertedResidual (14) [4, 160, 20, 20] [4, 160, 20, 20] 793,200 │ │ └─InvertedResidual (15) [4, 160, 20, 20] [4, 160, 20, 20] 793,200 │ │ └─Conv2dNormActivation (16) [4, 160, 20, 20] [4, 960, 20, 20] 153,600 │ └─FeaturePyramidNetwork (fpn) [4, 160, 20, 20] [4, 256, 10, 10] -- │ │ └─ModuleList (inner_blocks) -- -- (recursive) │ │ └─ModuleList (layer_blocks) -- -- (recursive) │ │ └─ModuleList (inner_blocks) -- -- (recursive) │ │ └─ModuleList (layer_blocks) -- -- (recursive) │ │ └─LastLevelMaxPool (extra_blocks) [4, 256, 20, 20] [4, 256, 20, 20] -- ├─RegionProposalNetwork (rpn) [4, 3, 640, 640] [0, 4] -- │ └─RPNHead (head) [4, 256, 20, 20] [4, 15, 20, 20] -- │ │ └─Sequential (conv) [4, 256, 20, 20] [4, 256, 20, 20] 590,080 │ │ └─Conv2d (cls_logits) [4, 256, 20, 20] [4, 15, 20, 20] 3,855 │ │ └─Conv2d (bbox_pred) [4, 256, 20, 20] [4, 60, 20, 20] 15,420 │ │ └─Sequential (conv) [4, 256, 20, 20] [4, 256, 20, 20] (recursive) │ │ └─Conv2d (cls_logits) [4, 256, 20, 20] [4, 15, 20, 20] (recursive) │ │ └─Conv2d (bbox_pred) [4, 256, 20, 20] [4, 60, 20, 20] (recursive) │ │ └─Sequential (conv) [4, 256, 10, 10] [4, 256, 10, 10] (recursive) │ │ └─Conv2d (cls_logits) [4, 256, 10, 10] [4, 15, 10, 10] (recursive) │ │ └─Conv2d (bbox_pred) [4, 256, 10, 10] [4, 60, 10, 10] (recursive) │ └─AnchorGenerator (anchor_generator) [4, 3, 640, 640] [13500, 4] -- ├─RoIHeads (roi_heads) [4, 256, 20, 20] [0, 4] -- │ └─MultiScaleRoIAlign (box_roi_pool) [4, 256, 20, 20] [0, 256, 7, 7] -- │ └─TwoMLPHead (box_head) [0, 256, 7, 7] [0, 1024] -- │ │ └─Linear (fc6) [0, 12544] [0, 1024] 12,846,080 │ │ └─Linear (fc7) [0, 1024] [0, 1024] 1,049,600 │ └─FastRCNNPredictor (box_predictor) [0, 1024] [0, 3] -- │ │ └─Linear (cls_score) [0, 1024] [0, 3] 3,075 │ │ └─Linear (bbox_pred) [0, 1024] [0, 12] 12,300

Total params: 18,935,354 Trainable params: 18,876,458 Non-trainable params: 58,896 Total mult-adds (G): 11.49

Input size (MB): 19.66 Forward/backward pass size (MB): 1172.14 Params size (MB): 75.74 Estimated Total Size (MB): 1267.54

18,935,354 total parameters. 18,876,458 training parameters. /usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() Epoch: [0] [ 0/222] eta: 0:11:02 lr: 0.000006 loss: 1.8196 (1.8196) loss_classifier: 1.4352 (1.4352) loss_box_reg: 0.3557 (0.3557) loss_objectness: 0.0227 (0.0227) loss_rpn_box_reg: 0.0060 (0.0060) time: 2.9830 data: 1.9134 max mem: 704 Epoch: [0] [100/222] eta: 0:00:22 lr: 0.000458 loss: 1.2597 (1.3672) loss_classifier: 0.5182 (0.6553) loss_box_reg: 0.7019 (0.6994) loss_objectness: 0.0014 (0.0098) loss_rpn_box_reg: 0.0025 (0.0027) time: 0.1611 data: 0.0257 max mem: 811 Epoch: [0] [200/222] eta: 0:00:03 lr: 0.000910 loss: 0.8597 (1.1901) loss_classifier: 0.2865 (0.5291) loss_box_reg: 0.5280 (0.6531) loss_objectness: 0.0006 (0.0057) loss_rpn_box_reg: 0.0013 (0.0023) time: 0.1735 data: 0.0235 max mem: 811 Epoch: [0] [221/222] eta: 0:00:00 lr: 0.001000 loss: 0.8436 (1.1645) loss_classifier: 0.3099 (0.5145) loss_box_reg: 0.5193 (0.6426) loss_objectness: 0.0005 (0.0053) loss_rpn_box_reg: 0.0012 (0.0022) time: 0.1591 data: 0.0203 max mem: 811 Epoch: [0] Total time: 0:00:34 (0.1552 s / it) creating index... index created! Traceback (most recent call last): File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 571, in main(args) File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 423, in main stats, val_pred_image = evaluate( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/engine.py", line 136, in evaluate for images, targets in metric_logger.log_every(data_loader, 100, header): File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/utils.py", line 202, in log_every log(f"{header} Total time: {total_time_str} ({total_time / len(iterable):.4f} s / it)") ZeroDivisionError: float division by zero Traceback (most recent call last): File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 571, in main(args) File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 423, in main stats, val_pred_image = evaluate( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/engine.py", line 136, in evaluate for images, targets in metric_logger.log_every(data_loader, 100, header): File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/utils.py", line 202, in log_every log(f"{header} Total time: {total_time_str} ({total_time / len(iterable):.4f} s / it)") ZeroDivisionError: float division by zero** wandb: 🚀 View run expert-fire-4 at: https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline/runs/bw79izjd wandb: ⭐️ View project at: https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20240624_152326-bw79izjd/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.

sovit-123 / fasterrcnn-pytorch-training-pipeline

Build model using fasterrcnn_mobilenetv3_large_fpn #147

Building model from scratch...

Layer (type (var_name)) Input Shape Output Shape Param #

Total params: 18,935,354 Trainable params: 18,876,458 Non-trainable params: 58,896 Total mult-adds (G): 11.49

Input size (MB): 19.66 Forward/backward pass size (MB): 1172.14 Params size (MB): 75.74 Estimated Total Size (MB): 1267.54