pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.26k stars 6.96k forks source link

loss_box_reg increasing while training mask rcnn #8608

Closed ArpanGyawali closed 2 months ago

ArpanGyawali commented 2 months ago

🐛 Describe the bug

I am trying to train maskRcnn model using detectron2 on my custom LVO deteset. My dataset is a single class dataset and some of the image have no annotation in it. The architecture need to learn negative examples as well for proper training as the test data contains both positive and negative lvo cases. I have segmentation annotation in coco format and have registered it using CocoRegistration. When I try to train the maskrcnn model the overall loss decreases but the loss_box_reg increases, and the prediction results bounding box have scores less then 0.1 for every cases (even positive cases). Why is this happening.

How to reproduce this error:

cfg = get_cfg()
# cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/retinanet_R_101_FPN_3x.yaml"))
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("train",)
cfg.DATASETS.TEST = ()   # no metrics implemented for this dataset
cfg.DATALOADER.NUM_WORKERS = 2
cfg.INPUT.MAX_SIZE_TRAIN = 512         # every training image have size 512
cfg.INPUT.MIN_SIZE_TRAIN = (512,)
cfg.INPUT.MAX_SIZE_TEST = 512
cfg.INPUT.MIN_SIZE_TEST = 512
cfg.INPUT.MASK_FORMAT = "bitmask"
# cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/retinanet_R_101_FPN_3x.yaml")  # initialize from model zoo
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml")
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 2000
cfg.SOLVER.CHECKPOINT_PERIOD = 200
cfg.SOLVER.STEPS = []        # do not decay learning rate
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512   # faster, and good enough for this toy dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1  # only has one class (ballon)
cfg.DATALOADER.FILTER_EMPTY_ANNOTATIONS = False
cfg.OUTPUT_DIR = out_dir
trainer = DefaultTrainer(cfg) 
trainer.resume_or_load(resume=False)
trainer.train()

My positive and negative dataset sample image Annotation example:

 {
      "id": 80,
      "image_id": 180,
      "category_id": 1,
      "segmentation": {
        "counts": [
          large list
        ],
        "size": [512, 512]
      },
      "area": 247.0,
      "bbox": [302.0, 227.0, 24.0, 13.0],
      "iscrowd": 0,
      "attributes": {  "occluded": false
      }},

Issue: Total loss: image

Loss_box_reg: image

My prediction scoreexample for positive cases: scores: tensor([0.0901, 0.0862, 0.0737, 0.0697, 0.0679, 0.0670, 0.0668, 0.0665, 0.0664, ........])

Help me solve this problem

Versions

Versions: PyTorch version: 2.0.0+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux 9.4 (Plow) (x86_64) GCC version: (GCC) 11.3.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.34

Python version: 3.9.18 (main, May 16 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (64-bit runtime) Python platform: Linux-5.14.0-427.18.1.el9_4.x86_64-x86_64-with-glibc2.34 Is CUDA available: False CUDA runtime version: 11.7.64 CUDA_MODULE_LOADING set to: N/A GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 Stepping: 6 CPU(s) scaling MHz: 100% CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 5800.00 Flags: -------some giberish----------- Virtualization: VT-x L1d cache: 1.5 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 40 MiB (32 instances) L3 cache: 48 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] torch==2.0.0 [pip3] torchvision==0.15.1 [pip3] triton==2.0.0 [conda] Could not collect

NicolasHug commented 2 months ago

Hi @ArpanGyawali ,

Sorry, this is a bug tracker and we're unable to help with open-ended questions like this one. You might be able to find some help on https://discuss.pytorch.org/