Getting matrix contains invalid numeric entries error

sarmientoj24 commented 2 years ago

When trying SparseInt with ViT, I get this error

  File "/home/user/anaconda3/envs/detectron2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/ard/SparseInst/sparseinst/loss.py", line 301, in forward
    indices = [linear_sum_assignment(c[i], maximize=True)
  File "/home/user/ard/SparseInst/sparseinst/loss.py", line 301, in <listcomp>
    indices = [linear_sum_assignment(c[i], maximize=True)
ValueError: matrix contains invalid numeric entries

Here's the config printed

[07/29 18:08:15 detectron2]: Command line arguments: Namespace(config_file='configs/sparse_inst_pvt_b2_li_giam.yaml', dist_url='tcp://127.0.0.1:50153', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=['SOLVER.AMP.ENABLED', 'True'], resume=False)
[07/29 18:08:15 detectron2]: Contents of args.config_file=configs/sparse_inst_pvt_b2_li_giam.yaml:
_BASE_: "Base-SparseInst.yaml"
MODEL:
  WEIGHTS: "pretrained_models/pvt_v2_b2_li.pth"
  BACKBONE:
    NAME: "build_pyramid_vision_transformer"
  SPARSE_INST:
    ENCODER:
      IN_FEATURES: ["p2", "p3", "p4"]
  PVT:
    NAME: "b2"
    LINEAR: True
    OUT_FEATURES: ["p2", "p3", "p4"]
OUTPUT_DIR: "output/sparse_inst_pvt_b2_linear_giam"

[07/29 18:08:15 detectron2]: Running with full config:
CUDNN_BENCHMARK: false
DATALOADER:
  ASPECT_RATIO_GROUPING: true
  FILTER_EMPTY_ANNOTATIONS: true
  NUM_WORKERS: 6
  REPEAT_THRESHOLD: 0.0
  SAMPLER_TRAIN: TrainingSampler
DATASETS:
  PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
  PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
  PROPOSAL_FILES_TEST: []
  PROPOSAL_FILES_TRAIN: []
  TEST:
  - bipacolortest
  TRAIN:
  - bipacolortrain
GLOBAL:
  HACK: 1.0
INPUT:
  CROP:
    ENABLED: false
    SIZE:
    - 0.9
    - 0.9
    TYPE: relative_range
  FORMAT: RGB
  MASK_FORMAT: bitmask
  MAX_SIZE_TEST: 853
  MAX_SIZE_TRAIN: 853
  MIN_SIZE_TEST: 640
  MIN_SIZE_TRAIN:
  - 416
  - 448
  - 480
  - 512
  - 544
  - 576
  - 608
  - 640
  MIN_SIZE_TRAIN_SAMPLING: choice
  RANDOM_FLIP: horizontal
MODEL:
  ANCHOR_GENERATOR:
    ANGLES:
    - - -90
      - 0
      - 90
    ASPECT_RATIOS:
    - - 0.5
      - 1.0
      - 2.0
    NAME: DefaultAnchorGenerator
    OFFSET: 0.0
    SIZES:
    - - 32
      - 64
      - 128
      - 256
      - 512
  BACKBONE:
    FREEZE_AT: 0
    NAME: build_pyramid_vision_transformer
  CSPNET:
    NAME: darknet53
    NORM: ''
    OUT_FEATURES:
    - csp1
    - csp2
    - csp3
    - csp4
  DEVICE: cuda
  FPN:
    FUSE_TYPE: sum
    IN_FEATURES: []
    NORM: ''
    OUT_CHANNELS: 256
  KEYPOINT_ON: false
  LOAD_PROPOSALS: false
  MASK_ON: true
  META_ARCHITECTURE: SparseInst
  PANOPTIC_FPN:
    COMBINE:
      ENABLED: true
      INSTANCES_CONFIDENCE_THRESH: 0.5
      OVERLAP_THRESH: 0.5
      STUFF_AREA_LIMIT: 4096
    INSTANCE_LOSS_WEIGHT: 1.0
  PIXEL_MEAN:
  - 123.675
  - 116.28
  - 103.53
  PIXEL_STD:
  - 58.395
  - 57.12
  - 57.375
  PROPOSAL_GENERATOR:
    MIN_SIZE: 0
    NAME: RPN
  PVT:
    LINEAR: true
    NAME: b2
    OUT_FEATURES:
    - p2
    - p3
    - p4
  RESNETS:
    DEFORM_MODULATED: false
    DEFORM_NUM_GROUPS: 1
    DEFORM_ON_PER_STAGE:
    - false
    - false
    - false
    - false
    DEPTH: 50
    NORM: FrozenBN
    NUM_GROUPS: 1
    OUT_FEATURES:
    - res3
    - res4
    - res5
    RES2_OUT_CHANNELS: 256
    RES5_DILATION: 1
    STEM_OUT_CHANNELS: 64
    STRIDE_IN_1X1: false
    WIDTH_PER_GROUP: 64
  RETINANET:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_WEIGHTS: &id002
    - 1.0
    - 1.0
    - 1.0
    - 1.0
    FOCAL_LOSS_ALPHA: 0.25
    FOCAL_LOSS_GAMMA: 2.0
    IN_FEATURES:
    - p3
    - p4
    - p5
    - p6
    - p7
    IOU_LABELS:
    - 0
    - -1
    - 1
    IOU_THRESHOLDS:
    - 0.4
    - 0.5
    NMS_THRESH_TEST: 0.5
    NORM: ''
    NUM_CLASSES: 80
    NUM_CONVS: 4
    PRIOR_PROB: 0.01
    SCORE_THRESH_TEST: 0.05
    SMOOTH_L1_LOSS_BETA: 0.1
    TOPK_CANDIDATES_TEST: 1000
  ROI_BOX_CASCADE_HEAD:
    BBOX_REG_WEIGHTS:
    - &id001
      - 10.0
      - 10.0
      - 5.0
      - 5.0
    - - 20.0
      - 20.0
      - 10.0
      - 10.0
    - - 30.0
      - 30.0
      - 15.0
      - 15.0
    IOUS:
    - 0.5
    - 0.6
    - 0.7
  ROI_BOX_HEAD:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: *id001
    CLS_AGNOSTIC_BBOX_REG: false
    CONV_DIM: 256
    FC_DIM: 1024
    NAME: ''
    NORM: ''
    NUM_CONV: 0
    NUM_FC: 0
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
    SMOOTH_L1_BETA: 0.0
    TRAIN_ON_PRED_BOXES: false
  ROI_HEADS:
    BATCH_SIZE_PER_IMAGE: 512
    IN_FEATURES:
    - res4
    IOU_LABELS:
    - 0
    - 1
    IOU_THRESHOLDS:
    - 0.5
    NAME: Res5ROIHeads
    NMS_THRESH_TEST: 0.5
    NUM_CLASSES: 80
    POSITIVE_FRACTION: 0.25
    PROPOSAL_APPEND_GT: true
    SCORE_THRESH_TEST: 0.05
  ROI_KEYPOINT_HEAD:
    CONV_DIMS:
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    LOSS_WEIGHT: 1.0
    MIN_KEYPOINTS_PER_IMAGE: 1
    NAME: KRCNNConvDeconvUpsampleHead
    NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
    NUM_KEYPOINTS: 17
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  ROI_MASK_HEAD:
    CLS_AGNOSTIC_MASK: false
    CONV_DIM: 256
    NAME: MaskRCNNConvUpsampleHead
    NORM: ''
    NUM_CONV: 0
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  RPN:
    BATCH_SIZE_PER_IMAGE: 256
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: *id002
    BOUNDARY_THRESH: -1
    CONV_DIMS:
    - -1
    HEAD_NAME: StandardRPNHead
    IN_FEATURES:
    - res4
    IOU_LABELS:
    - 0
    - -1
    - 1
    IOU_THRESHOLDS:
    - 0.3
    - 0.7
    LOSS_WEIGHT: 1.0
    NMS_THRESH: 0.7
    POSITIVE_FRACTION: 0.5
    POST_NMS_TOPK_TEST: 1000
    POST_NMS_TOPK_TRAIN: 2000
    PRE_NMS_TOPK_TEST: 6000
    PRE_NMS_TOPK_TRAIN: 12000
    SMOOTH_L1_BETA: 0.0
  SEM_SEG_HEAD:
    COMMON_STRIDE: 4
    CONVS_DIM: 128
    IGNORE_VALUE: 255
    IN_FEATURES:
    - p2
    - p3
    - p4
    - p5
    LOSS_WEIGHT: 1.0
    NAME: SemSegFPNHead
    NORM: GN
    NUM_CLASSES: 54
  SPARSE_INST:
    CLS_THRESHOLD: 0.005
    DATASET_MAPPER: SparseInstDatasetMapper
    DECODER:
      GROUPS: 4
      INST:
        CONVS: 4
        DIM: 256
      KERNEL_DIM: 128
      MASK:
        CONVS: 4
        DIM: 256
      NAME: GroupIAMDecoder
      NUM_CLASSES: 10
      NUM_MASKS: 100
      OUTPUT_IAM: false
      SCALE_FACTOR: 2.0
    ENCODER:
      IN_FEATURES:
      - p2
      - p3
      - p4
      NAME: InstanceContextEncoder
      NORM: ''
      NUM_CHANNELS: 256
    LOSS:
      CLASS_WEIGHT: 2.0
      ITEMS:
      - labels
      - masks
      MASK_DICE_WEIGHT: 2.0
      MASK_PIXEL_WEIGHT: 5.0
      NAME: SparseInstCriterion
      OBJECTNESS_WEIGHT: 1.0
    MASK_THRESHOLD: 0.45
    MATCHER:
      ALPHA: 0.8
      BETA: 0.2
      NAME: SparseInstMatcher
    MAX_DETECTIONS: 100
  WEIGHTS: sparse_inst_pvt_v2_b2_li_giam_02e25d.pth
OUTPUT_DIR: output/sparse_inst_pvt_b2_linear_giam
SEED: -1
SOLVER:
  AMP:
    ENABLED: true
  AMSGRAD: false
  BACKBONE_MULTIPLIER: 1.0
  BASE_LR: 5.0e-05
  BIAS_LR_FACTOR: 1.0
  CHECKPOINT_PERIOD: 5000
  CLIP_GRADIENTS:
    CLIP_TYPE: value
    CLIP_VALUE: 1.0
    ENABLED: false
    NORM_TYPE: 2.0
  GAMMA: 0.1
  IMS_PER_BATCH: 32
  LR_SCHEDULER_NAME: WarmupMultiStepLR
  MAX_ITER: 1500
  MOMENTUM: 0.9
  NESTEROV: false
  OPTIMIZER: ADAMW
  REFERENCE_WORLD_SIZE: 0
  STEPS:
  - 1166
  - 1388
  WARMUP_FACTOR: 0.001
  WARMUP_ITERS: 1000
  WARMUP_METHOD: linear
  WEIGHT_DECAY: 0.05
  WEIGHT_DECAY_BIAS: null
  WEIGHT_DECAY_NORM: 0.0
TEST:
  AUG:
    ENABLED: false
    FLIP: true
    MAX_SIZE: 4000
    MIN_SIZES:
    - 400
    - 500
    - 600
    - 700
    - 800
    - 900
    - 1000
    - 1100
    - 1200
  DETECTIONS_PER_IMAGE: 100
  EVAL_PERIOD: 60
  EXPECTED_RESULTS: []
  KEYPOINT_OKS_SIGMAS: []
  PRECISE_BN:
    ENABLED: false
    NUM_ITER: 200
VERSION: 2
VIS_PERIOD: 0

wondervictor commented 2 years ago

Hi @sarmientoj24, thanks for your interest in SparseInst. Have you load any pretrained weights?

sarmientoj24 commented 2 years ago

Yes

On Mon, Aug 1, 2022, 17:37 Tianheng Cheng @.***> wrote:

Hi @sarmientoj24 https://github.com/sarmientoj24, thanks for your interest in SparseInst. Have you load any pretrained weights?

— Reply to this email directly, view it on GitHub https://github.com/hustvl/SparseInst/issues/66#issuecomment-1200961391, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDL23Z2BH6QR4Y3GWFTRGTVW6LGZANCNFSM55BRMKUQ . You are receiving this because you were mentioned.Message ID: @.***>

wondervictor commented 2 years ago

Could you provide the log of the training process?

BryantGary commented 2 years ago

Could you provide the log of the training process?

Hello, I met the same problem as well when I try to change the optimizer from ADAMW to SGD.

koolvn commented 2 years ago

I fell into this problem too. The cause is: SOLVER: AMP: ENABLED: true

Set it to False (use FP32) and the error disappears

I've tried to debug but wasn't able to fix it

Hellller commented 2 years ago

I haven't loaded any pre-trained weights. The problem also exists.

wondervictor commented 2 years ago

Hi all, I've found that the sigmoid + norm in the decoder will cause the NaN error when FP16 is enabled. In the latest update, we provide a special softmax version of the decoder to avoid numerical errors, and it supports FP16 better than the sigmoid + norm. Sorry for the late reply and hope my suggestion can help you.

kirillkoncha commented 2 years ago

Hi all, I've found that the sigmoid + norm in the decoder will cause the NaN error when FP16 is enabled. In the latest update, we provide a special softmax version of the decoder to avoid numerical errors, and it supports FP16 better than the sigmoid + norm. Sorry for the late reply and hope my suggestion can help you.

I got the same problem now. I am using pre-trained weights and trying to train R-50-vd-DCN model. Are there additional steps to use new softmax version?

Spritaro commented 2 years ago

It seems sigmoid + norm is used by default. Adding MODEL.SPARSE_INST.DECODER.NAME GroupIAMSoftDecoder to the command line solved the problem for me.

MrL-CV commented 1 year ago

It seems sigmoid + norm is used by default. Adding MODEL.SPARSE_INST.DECODER.NAME GroupIAMSoftDecoder to the command line solved the problem for me.

It still does not work...

hustvl / SparseInst

Getting matrix contains invalid numeric entries error #66