I am trying to train maskRcnn model using detectron2 on my custom LVO deteset. My dataset is a single class dataset and some of the image have no annotation in it. The architecture need to learn negative examples as well for proper training as the test data contains both positive and negative lvo cases. I have segmentation annotation in coco format and have registered it using CocoRegistration.
When I try to train the maskrcnn model the overall loss decreases but the loss_box_reg increases, and the prediction results bounding box have scores less then 0.1 for every cases (even positive cases). Why is this happening.
How to reproduce this error:
cfg = get_cfg()
# cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/retinanet_R_101_FPN_3x.yaml"))
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("train",)
cfg.DATASETS.TEST = () # no metrics implemented for this dataset
cfg.DATALOADER.NUM_WORKERS = 2
cfg.INPUT.MAX_SIZE_TRAIN = 512 # every training image have size 512
cfg.INPUT.MIN_SIZE_TRAIN = (512,)
cfg.INPUT.MAX_SIZE_TEST = 512
cfg.INPUT.MIN_SIZE_TEST = 512
cfg.INPUT.MASK_FORMAT = "bitmask"
# cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/retinanet_R_101_FPN_3x.yaml") # initialize from model zoo
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml")
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 2000
cfg.SOLVER.CHECKPOINT_PERIOD = 200
cfg.SOLVER.STEPS = [] # do not decay learning rate
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512 # faster, and good enough for this toy dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 # only has one class (ballon)
cfg.DATALOADER.FILTER_EMPTY_ANNOTATIONS = False
cfg.OUTPUT_DIR = out_dir
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
My positive and negative dataset sample
Annotation example:
My prediction scoreexample for positive cases:
scores: tensor([0.0901, 0.0862, 0.0737, 0.0697, 0.0679, 0.0670, 0.0668, 0.0665, 0.0664, ........])
Help me solve this problem
Versions
Versions:
PyTorch version: 2.0.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Red Hat Enterprise Linux 9.4 (Plow) (x86_64)
GCC version: (GCC) 11.3.0
Clang version: Could not collect
CMake version: version 3.28.3
Libc version: glibc-2.34
Python version: 3.9.18 (main, May 16 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (64-bit runtime)
Python platform: Linux-5.14.0-427.18.1.el9_4.x86_64-x86_64-with-glibc2.34
Is CUDA available: False
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
Stepping: 6
CPU(s) scaling MHz: 100%
CPU max MHz: 3500.0000
CPU min MHz: 800.0000
BogoMIPS: 5800.00
Flags: -------some giberish-----------
Virtualization: VT-x
L1d cache: 1.5 MiB (32 instances)
L1i cache: 1 MiB (32 instances)
L2 cache: 40 MiB (32 instances)
L3 cache: 48 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] torch==2.0.0
[pip3] torchvision==0.15.1
[pip3] triton==2.0.0
[conda] Could not collect
Sorry, this is a bug tracker and we're unable to help with open-ended questions like this one. You might be able to find some help on https://discuss.pytorch.org/
🐛 Describe the bug
I am trying to train maskRcnn model using detectron2 on my custom LVO deteset. My dataset is a single class dataset and some of the image have no annotation in it. The architecture need to learn negative examples as well for proper training as the test data contains both positive and negative lvo cases. I have segmentation annotation in coco format and have registered it using CocoRegistration. When I try to train the maskrcnn model the overall loss decreases but the loss_box_reg increases, and the prediction results bounding box have scores less then 0.1 for every cases (even positive cases). Why is this happening.
How to reproduce this error:
My positive and negative dataset sample Annotation example:
Issue: Total loss:
Loss_box_reg:
My prediction scoreexample for positive cases: scores: tensor([0.0901, 0.0862, 0.0737, 0.0697, 0.0679, 0.0670, 0.0668, 0.0665, 0.0664, ........])
Help me solve this problem
Versions
Versions: PyTorch version: 2.0.0+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A
OS: Red Hat Enterprise Linux 9.4 (Plow) (x86_64) GCC version: (GCC) 11.3.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.34
Python version: 3.9.18 (main, May 16 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (64-bit runtime) Python platform: Linux-5.14.0-427.18.1.el9_4.x86_64-x86_64-with-glibc2.34 Is CUDA available: False CUDA runtime version: 11.7.64 CUDA_MODULE_LOADING set to: N/A GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 Stepping: 6 CPU(s) scaling MHz: 100% CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 5800.00 Flags: -------some giberish----------- Virtualization: VT-x L1d cache: 1.5 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 40 MiB (32 instances) L3 cache: 48 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] torch==2.0.0 [pip3] torchvision==0.15.1 [pip3] triton==2.0.0 [conda] Could not collect