2024-06-24 15:23:20.794655: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-24 15:23:20.794717: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-24 15:23:20.796062: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-24 15:23:20.803158: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-24 15:23:21.919523: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Not using distributed mode
wandb: Currently logged in as: pusatstudiaiunsulbar (pusatsudiaiusb). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.17.2
wandb: Run data is saved locally in /content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/wandb/run-20240624_152326-bw79izjd
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run expert-fire-4
wandb: ⭐️ View project at https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline
wandb: 🚀 View run at https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline/runs/bw79izjd
device cuda
Checking Labels and images...
100% 886/886 [00:00<00:00, 116878.55it/s]
Checking Labels and images...
0it [00:00, ?it/s]
Creating data loaders
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Number of training samples: 886
Number of validation samples: 0
I want to run my training process with the command:
!python train.py --data data_configs/data_training.yaml --epochs 40 --model fasterrcnn_mobilenetv3_large_fpn --project-dir fasterrcnn_mobilenetv3_large_fpn --seed 8
and I get an error in my program as follows:
2024-06-24 15:23:20.794655: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-06-24 15:23:20.794717: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-06-24 15:23:20.796062: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-06-24 15:23:20.803158: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-06-24 15:23:21.919523: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Not using distributed mode wandb: Currently logged in as: pusatstudiaiunsulbar (pusatsudiaiusb). Use
wandb login --relogin
to force relogin wandb: Tracking run with wandb version 0.17.2 wandb: Run data is saved locally in /content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/wandb/run-20240624_152326-bw79izjd wandb: Runwandb offline
to turn off syncing. wandb: Syncing run expert-fire-4 wandb: ⭐️ View project at https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline wandb: 🚀 View run at https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline/runs/bw79izjd device cuda Checking Labels and images... 100% 886/886 [00:00<00:00, 116878.55it/s] Checking Labels and images... 0it [00:00, ?it/s] Creating data loaders /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Number of training samples: 886 Number of validation samples: 0Building model from scratch...
Layer (type (var_name)) Input Shape Output Shape Param #
FasterRCNN (FasterRCNN) [4, 3, 640, 640] [0, 4] -- ├─GeneralizedRCNNTransform (transform) [4, 3, 640, 640] [4, 3, 640, 640] -- ├─BackboneWithFPN (backbone) [4, 3, 640, 640] [4, 256, 10, 10] -- │ └─IntermediateLayerGetter (body) [4, 3, 640, 640] [4, 960, 20, 20] -- │ │ └─Conv2dNormActivation (0) [4, 3, 640, 640] [4, 16, 320, 320] (432) │ │ └─InvertedResidual (1) [4, 16, 320, 320] [4, 16, 320, 320] (400) │ │ └─InvertedResidual (2) [4, 16, 320, 320] [4, 24, 160, 160] (3,136) │ │ └─InvertedResidual (3) [4, 24, 160, 160] [4, 24, 160, 160] (4,104) │ │ └─InvertedResidual (4) [4, 24, 160, 160] [4, 40, 80, 80] (9,960) │ │ └─InvertedResidual (5) [4, 40, 80, 80] [4, 40, 80, 80] (20,432) │ │ └─InvertedResidual (6) [4, 40, 80, 80] [4, 40, 80, 80] (20,432) │ │ └─InvertedResidual (7) [4, 40, 80, 80] [4, 80, 40, 40] 30,960 │ │ └─InvertedResidual (8) [4, 80, 40, 40] [4, 80, 40, 40] 33,800 │ │ └─InvertedResidual (9) [4, 80, 40, 40] [4, 80, 40, 40] 31,096 │ │ └─InvertedResidual (10) [4, 80, 40, 40] [4, 80, 40, 40] 31,096 │ │ └─InvertedResidual (11) [4, 80, 40, 40] [4, 112, 40, 40] 212,280 │ │ └─InvertedResidual (12) [4, 112, 40, 40] [4, 112, 40, 40] 383,208 │ │ └─InvertedResidual (13) [4, 112, 40, 40] [4, 160, 20, 20] 426,216 │ │ └─InvertedResidual (14) [4, 160, 20, 20] [4, 160, 20, 20] 793,200 │ │ └─InvertedResidual (15) [4, 160, 20, 20] [4, 160, 20, 20] 793,200 │ │ └─Conv2dNormActivation (16) [4, 160, 20, 20] [4, 960, 20, 20] 153,600 │ └─FeaturePyramidNetwork (fpn) [4, 160, 20, 20] [4, 256, 10, 10] -- │ │ └─ModuleList (inner_blocks) -- -- (recursive) │ │ └─ModuleList (layer_blocks) -- -- (recursive) │ │ └─ModuleList (inner_blocks) -- -- (recursive) │ │ └─ModuleList (layer_blocks) -- -- (recursive) │ │ └─LastLevelMaxPool (extra_blocks) [4, 256, 20, 20] [4, 256, 20, 20] -- ├─RegionProposalNetwork (rpn) [4, 3, 640, 640] [0, 4] -- │ └─RPNHead (head) [4, 256, 20, 20] [4, 15, 20, 20] -- │ │ └─Sequential (conv) [4, 256, 20, 20] [4, 256, 20, 20] 590,080 │ │ └─Conv2d (cls_logits) [4, 256, 20, 20] [4, 15, 20, 20] 3,855 │ │ └─Conv2d (bbox_pred) [4, 256, 20, 20] [4, 60, 20, 20] 15,420 │ │ └─Sequential (conv) [4, 256, 20, 20] [4, 256, 20, 20] (recursive) │ │ └─Conv2d (cls_logits) [4, 256, 20, 20] [4, 15, 20, 20] (recursive) │ │ └─Conv2d (bbox_pred) [4, 256, 20, 20] [4, 60, 20, 20] (recursive) │ │ └─Sequential (conv) [4, 256, 10, 10] [4, 256, 10, 10] (recursive) │ │ └─Conv2d (cls_logits) [4, 256, 10, 10] [4, 15, 10, 10] (recursive) │ │ └─Conv2d (bbox_pred) [4, 256, 10, 10] [4, 60, 10, 10] (recursive) │ └─AnchorGenerator (anchor_generator) [4, 3, 640, 640] [13500, 4] -- ├─RoIHeads (roi_heads) [4, 256, 20, 20] [0, 4] -- │ └─MultiScaleRoIAlign (box_roi_pool) [4, 256, 20, 20] [0, 256, 7, 7] -- │ └─TwoMLPHead (box_head) [0, 256, 7, 7] [0, 1024] -- │ │ └─Linear (fc6) [0, 12544] [0, 1024] 12,846,080 │ │ └─Linear (fc7) [0, 1024] [0, 1024] 1,049,600 │ └─FastRCNNPredictor (box_predictor) [0, 1024] [0, 3] -- │ │ └─Linear (cls_score) [0, 1024] [0, 3] 3,075 │ │ └─Linear (bbox_pred) [0, 1024] [0, 12] 12,300
Total params: 18,935,354 Trainable params: 18,876,458 Non-trainable params: 58,896 Total mult-adds (G): 11.49
Input size (MB): 19.66 Forward/backward pass size (MB): 1172.14 Params size (MB): 75.74 Estimated Total Size (MB): 1267.54
18,935,354 total parameters. 18,876,458 training parameters. /usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() Epoch: [0] [ 0/222] eta: 0:11:02 lr: 0.000006 loss: 1.8196 (1.8196) loss_classifier: 1.4352 (1.4352) loss_box_reg: 0.3557 (0.3557) loss_objectness: 0.0227 (0.0227) loss_rpn_box_reg: 0.0060 (0.0060) time: 2.9830 data: 1.9134 max mem: 704 Epoch: [0] [100/222] eta: 0:00:22 lr: 0.000458 loss: 1.2597 (1.3672) loss_classifier: 0.5182 (0.6553) loss_box_reg: 0.7019 (0.6994) loss_objectness: 0.0014 (0.0098) loss_rpn_box_reg: 0.0025 (0.0027) time: 0.1611 data: 0.0257 max mem: 811 Epoch: [0] [200/222] eta: 0:00:03 lr: 0.000910 loss: 0.8597 (1.1901) loss_classifier: 0.2865 (0.5291) loss_box_reg: 0.5280 (0.6531) loss_objectness: 0.0006 (0.0057) loss_rpn_box_reg: 0.0013 (0.0023) time: 0.1735 data: 0.0235 max mem: 811 Epoch: [0] [221/222] eta: 0:00:00 lr: 0.001000 loss: 0.8436 (1.1645) loss_classifier: 0.3099 (0.5145) loss_box_reg: 0.5193 (0.6426) loss_objectness: 0.0005 (0.0053) loss_rpn_box_reg: 0.0012 (0.0022) time: 0.1591 data: 0.0203 max mem: 811 Epoch: [0] Total time: 0:00:34 (0.1552 s / it) creating index... index created! Traceback (most recent call last): File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 571, in
main(args)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 423, in main
stats, val_pred_image = evaluate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, *kwargs)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/engine.py", line 136, in evaluate
for images, targets in metric_logger.log_every(data_loader, 100, header):
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/utils.py", line 202, in log_every
log(f"{header} Total time: {total_time_str} ({total_time / len(iterable):.4f} s / it)")
ZeroDivisionError: float division by zero
Traceback (most recent call last):
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 571, in
main(args)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 423, in main
stats, val_pred_image = evaluate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func( args, kwargs)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/engine.py", line 136, in evaluate
for images, targets in metric_logger.log_every(data_loader, 100, header):
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/utils.py", line 202, in log_every
log(f"{header} Total time: {total_time_str} ({total_time / len(iterable):.4f} s / it)")
ZeroDivisionError: float division by zero**
wandb: 🚀 View run expert-fire-4 at: https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline/runs/bw79izjd
wandb: ⭐️ View project at: https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240624_152326-bw79izjd/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with
wandb.require("core")
! See https://wandb.me/wandb-core for more information.