thangvubk / SoftGroup

[CVPR 2022 Oral] SoftGroup for Instance Segmentation on 3D Point Clouds
MIT License
359 stars 81 forks source link

Why are the training results worse after adding weights? Is there something wrong with the way I added the weights? #128

Open Fan-QY opened 2 years ago

Fan-QY commented 2 years ago

Hello, I am to want to use Softroup for instance segmentation in my own dataset. Among other things I have transformed my dataset according to S3DIS. My PC ist TitanXP (12GB) graphics and my original dataset has only 5 scenes (5 files for training). Since my individual scenes are too large and complex, I sliced them, where each slice has 4 million points. This resulted in 44 training files. The dataset has a large gap between the points in different classes and is not evenly distributed. The number of training points and instances in different categories for the 44 training files are as follows:

Total number of points of each class: [10152406, 39686946, 9433358, 1552109, 92844212, 2696146, 498094, 7429789] Under each class is the number of instances: [139, 197, 433, 36, 483, 16, 92, 41] class_numpoint_mean [73038, 201456, 21786, 43114, 192224, 168509, 5414, 181214]

My backbone's training configuration file is as follows

model:
  channels: 32
  num_blocks: 7
  semantic_classes: 8
  instance_classes: 8
  sem2ins_classes: []
  semantic_only: True
  ignore_label: -100
  grouping_cfg:
    score_thr: 0.2
    radius: 0.05
    mean_active: 300
    class_numpoint_mean: [73038, 201456, 21786, 43114, 192224, 168509, 5414, 181214]
    npoint_thr: 0.05  # absolute if class_numpoint == -1, relative if class_numpoint != -1
    ignore_classes: []
  instance_voxel_cfg:
    scale: 25
    spatial_shape: 20
  train_cfg:
    max_proposal_num: 200
    pos_iou_thr: 0.5
  test_cfg:
    x4_split: True
    cls_score_thr: 0.001
    mask_score_thr: -0.5
    min_npoint: 100
  fixed_modules: []

data:
  train:
    type: 's3dis'
    data_root: './dataset/Argos_2step/preprocess'
    prefix: ['TRAIN']
    suffix: '_inst_nostuff.pth'
    repeat: 20
    training: True
    voxel_cfg:
      scale: 25
      spatial_shape: [128, 512]
      max_npoint: 250000
      min_npoint: 5000
  test:
    type: 's3dis'
    data_root: './dataset/Argos_2step/preprocess'
    prefix: 'TEST_Val'
    suffix: '_inst_nostuff.pth'
    training: False
    voxel_cfg:
      scale: 25
      spatial_shape: [128, 512]
      max_npoint: 250000
      min_npoint: 5000

dataloader:
  train:
    batch_size: 2
    num_workers: 0
  test:
    batch_size: 1
    num_workers: 1

optimizer:
  type: 'Adam'
  lr: 0.001

save_cfg:
  semantic: True
  offset: True
  instance: False

fp16: False
epochs: 50
step_epoch: 0
save_freq: 1
pretrain: './12Class_argos_epoch_91.pth'
work_dir: ''

(The file of pretrain is the result of my 12 classes under the same dataset, and the IoU of 4 classes is 0.0% due to the uneven rhyme division of its 12 class points. So I changed the dataset from 12 class datasets to 8 class datasets ) The best results obtained are shown below, two of the categories were not identified at all.

2022-08-24 08:57:02,276 - INFO - Epoch [40/50][430/440]  lr: 0.00012, eta: 14:48:39, mem: 3992, data_time: 7.29, iter_time: 7.64, semantic_loss: 0.0576, offset_loss: 0.8793, loss: 0.9369
2022-08-24 08:58:52,630 - INFO - Epoch [40/50][440/440]  lr: 0.00012, eta: 14:44:53, mem: 3992, data_time: 14.35, iter_time: 14.59, semantic_loss: 0.9506, offset_loss: 1.2817, loss: 2.2323
2022-08-24 08:58:53,096 - INFO - Validation
2022-08-24 08:59:41,476 - INFO - Evaluate semantic segmentation and offset MAE
2022-08-24 08:59:45,174 - INFO - Class-wise mIoU: 40.5 74.9 21.7 0.1 97.4 22.8 0.0 38.9
2022-08-24 08:59:45,174 - INFO - mIoU: 37.0
2022-08-24 08:59:46,006 - INFO - Acc: 96.6
2022-08-24 08:59:52,687 - INFO - Offset MAE: 3.000

By understanding the weights, I determined the weights by referring to the method of the website below. https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/ wj=n_samples / (n_classes * n_samplesj)

The new training configuration was set up from scratch, as follows.

model:
  channels: 32
  num_blocks: 7
  semantic_classes: 8
  instance_classes: 8
  sem2ins_classes: []
  semantic_only: True
  semantic_weight: [0.1164, 0.0210, 0.0402, 2.9403, 0.0037, 3.8085, 3.5853, 0.5393]  
  ignore_label: -100
  grouping_cfg:
    score_thr: 0.2
    radius: 0.05
    mean_active: 300
    class_numpoint_mean: [73038, 201456, 21786, 43114, 192224, 168509, 5414, 181214]
    npoint_thr: 0.05  # absolute if class_numpoint == -1, relative if class_numpoint != -1
    ignore_classes: []
  instance_voxel_cfg:
    scale: 25
    spatial_shape: 20
  train_cfg:
    max_proposal_num: 200
    pos_iou_thr: 0.5
  test_cfg:
    x4_split: True
    cls_score_thr: 0.001
    mask_score_thr: -0.5
    min_npoint: 100
  fixed_modules: []

data:
  train:
    type: 'argos2step'
    data_root: './dataset/Argos_2step/preprocess'
    prefix: ['TRAIN']
    suffix: '_inst_nostuff.pth'
    repeat: 20
    training: True
    voxel_cfg:
      scale: 25
      spatial_shape: [128, 512]
      max_npoint: 250000
      min_npoint: 5000
  test:
    type: 'argos2step'
    data_root: './dataset/Argos_2step/preprocess'
    prefix: 'TEST_Val'
    suffix: '_inst_nostuff.pth'
    training: False
    voxel_cfg:
      scale: 25
      spatial_shape: [128, 512]
      max_npoint: 250000
      min_npoint: 5000

dataloader:
  train:
    batch_size: 2
    num_workers: 0
  test:
    batch_size: 1
    num_workers: 1

optimizer:
  type: 'Adam'
  lr: 0.001

save_cfg:
  semantic: True
  offset: True
  instance: False

fp16: False
epochs: 50
step_epoch: 0
save_freq: 1
pretrain: './12Class_argos_epoch_91.pth' 
work_dir: ''

The best result is shown below, and the result is worse than the first backbone training.

2022-08-28 05:43:21,214 - INFO - Epoch [35/50][430/440]  lr: 0.00023, eta: 23:19:59, mem: 4074, data_time: 9.09, iter_time: 9.45, semantic_loss: 0.1890, offset_loss: 0.9530, loss: 1.1420
2022-08-28 05:45:19,052 - INFO - Epoch [35/50][440/440]  lr: 0.00023, eta: 23:15:33, mem: 4074, data_time: 10.31, iter_time: 10.71, semantic_loss: 0.6147, offset_loss: 1.5111, loss: 2.1258
2022-08-28 05:45:19,471 - INFO - Validation
2022-08-28 05:46:11,173 - INFO - Evaluate semantic segmentation and offset MAE
2022-08-28 05:46:14,921 - INFO - Class-wise mIoU: 1.4 46.2 11.9 1.7 69.5 3.4 7.5 11.2
2022-08-28 05:46:14,921 - INFO - mIoU: 19.1
2022-08-28 05:46:15,757 - INFO - Acc: 68.0

I used the best result of the first training (becaus more mIoU) for the instance training, and I think the model cannot learn anything and generates the following error at the end of the training.

2022-09-05 05:54:26,317 - INFO - Epoch [9/20][380/440]  lr: 0.00065, eta: 4 days, 1:13:04, mem: 6371, data_time: 40.54, iter_time: 41.30, semantic_loss: 0.6305, offset_loss: 2.7140, cls_loss: 0.3335, mask_loss: 0.2898, iou_score_loss: 0.0015, loss: 3.9693
2022-09-05 06:00:55,995 - INFO - Epoch [9/20][390/440]  lr: 0.00065, eta: 3 days, 23:53:21, mem: 6371, data_time: 42.81, iter_time: 43.18, semantic_loss: 0.9054, offset_loss: 2.2620, cls_loss: 0.2450, mask_loss: 0.0000, iou_score_loss: 0.0000, loss: 3.4124
2022-09-05 06:10:57,471 - INFO - Epoch [9/20][400/440]  lr: 0.00065, eta: 3 days, 23:20:20, mem: 6371, data_time: 56.66, iter_time: 57.49, semantic_loss: 0.9331, offset_loss: 2.0132, cls_loss: 0.3452, mask_loss: 0.0316, iou_score_loss: 0.0023, loss: 3.3254
2022-09-05 06:24:10,923 - INFO - Epoch [9/20][410/440]  lr: 0.00065, eta: 3 days, 23:26:27, mem: 6371, data_time: 30.72, iter_time: 31.23, semantic_loss: 0.6740, offset_loss: 1.5058, cls_loss: 0.5276, mask_loss: 0.0994, iou_score_loss: 0.0075, loss: 2.8144
2022-09-05 06:32:06,449 - INFO - Epoch [9/20][420/440]  lr: 0.00065, eta: 3 days, 22:30:21, mem: 6371, data_time: 13.14, iter_time: 13.77, semantic_loss: 0.6104, offset_loss: 1.5439, cls_loss: 0.5656, mask_loss: 0.3663, iou_score_loss: 0.0174, loss: 3.1037
2022-09-05 06:42:45,222 - INFO - Epoch [9/20][430/440]  lr: 0.00065, eta: 3 days, 22:07:09, mem: 6371, data_time: 73.94, iter_time: 74.77, semantic_loss: 0.5687, offset_loss: 3.1296, cls_loss: 0.2770, mask_loss: 0.0943, iou_score_loss: 0.0001, loss: 4.0696
2022-09-05 06:49:19,526 - INFO - Epoch [9/20][440/440]  lr: 0.00065, eta: 3 days, 20:59:43, mem: 6371, data_time: 26.22, iter_time: 26.78, semantic_loss: 0.7751, offset_loss: 2.4242, cls_loss: 0.2886, mask_loss: 0.0000, iou_score_loss: 0.0000, loss: 3.4879
2022-09-05 06:49:19,964 - INFO - Validation
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [19:40<00:00, 73.75s/it]
2022-09-05 07:09:00,017 - INFO - Evaluate instance segmentation

################################################################
what           :      AP  AP_50%  AP_25%      AR  RC_50%  RC_25%
################################################################
Class1         :   0.001   0.007   0.026   0.006   0.050   0.225
Class2         :   0.003   0.011   0.064   0.048   0.130   0.290
Class3         :   0.000   0.000   0.001   0.000   0.000   0.009
Class4         :   0.000   0.000   0.000   0.000   0.000   0.000
Class5         :   0.003   0.013   0.055   0.026   0.055   0.170
Class6         :   0.000   0.000   0.000   0.000   0.000   0.000
Class7         :   0.000   0.000   0.000   0.000   0.000   0.000
Class8         :   0.000   0.000   0.000   0.000   0.000   0.000
----------------------------------------------------------------
average        :   0.001   0.004   0.018   0.010   0.029   0.087
################################################################

2022-09-05 07:09:10,192 - INFO - AP: 0.001. AP_50: 0.004. AP_25: 0.018
2022-09-05 07:09:10,193 - INFO - Evaluate semantic segmentation and offset MAE
2022-09-05 07:09:13,858 - INFO - Class-wise mIoU: 5.6 40.1 5.7 0.0 89.5 1.8 0.0 1.5
2022-09-05 07:09:13,858 - INFO - mIoU: 18.0
2022-09-05 07:09:14,661 - INFO - Acc: 88.7
2022-09-05 07:09:17,182 - INFO - Offset MAE: 3.396
2022-09-05 07:09:17,186 - INFO - Summary name val/Offset MAE is illegal; using val/Offset_MAE instead.
2022-09-05 07:16:44,830 - INFO - Epoch [10/20][10/440]  lr: 0.00058, eta: 2 days, 12:03:28, mem: 6371, data_time: 24.11, iter_time: 24.79, semantic_loss: 0.7134, offset_loss: 1.7887, cls_loss: 0.2242, mask_loss: 0.5697, iou_score_loss: 0.0037, loss: 3.2996
2022-09-05 07:26:15,497 - INFO - Epoch [10/20][20/440]  lr: 0.00058, eta: 2 days, 20:10:11, mem: 6371, data_time: 42.56, iter_time: 43.06, semantic_loss: 0.6615, offset_loss: 2.0141, cls_loss: 0.4600, mask_loss: 0.5370, iou_score_loss: 0.0092, loss: 3.6818
2022-09-05 07:39:17,985 - INFO - Epoch [10/20][30/440]  lr: 0.00058, eta: 3 days, 8:12:07, mem: 6371, data_time: 39.17, iter_time: 39.49, semantic_loss: 0.5394, offset_loss: 1.3165, cls_loss: 0.8099, mask_loss: 0.3493, iou_score_loss: 0.0103, loss: 3.0255
2022-09-05 07:47:41,902 - INFO - Epoch [10/20][40/440]  lr: 0.00058, eta: 3 days, 4:49:25, mem: 6371, data_time: 87.49, iter_time: 87.98, semantic_loss: 0.4849, offset_loss: 1.4455, cls_loss: 0.3663, mask_loss: 0.2253, iou_score_loss: 0.0050, loss: 2.5270
2022-09-05 07:55:46,603 - INFO - Epoch [10/20][50/440]  lr: 0.00058, eta: 3 days, 2:13:45, mem: 6371, data_time: 59.51, iter_time: 60.54, semantic_loss: 0.8387, offset_loss: 1.3850, cls_loss: 0.1845, mask_loss: 1.0306, iou_score_loss: 0.0152, loss: 3.4541
2022-09-05 08:02:54,277 - INFO - Epoch [10/20][60/440]  lr: 0.00058, eta: 2 days, 23:11:34, mem: 6371, data_time: 7.64, iter_time: 8.10, semantic_loss: 0.7565, offset_loss: 2.4156, cls_loss: 0.8639, mask_loss: 0.5654, iou_score_loss: 0.0069, loss: 4.6084
2022-09-05 08:21:04,940 - INFO - Epoch [10/20][70/440]  lr: 0.00058, eta: 3 days, 9:32:22, mem: 6371, data_time: 32.12, iter_time: 32.51, semantic_loss: 0.1514, offset_loss: 2.1460, cls_loss: 0.3221, mask_loss: 0.0720, iou_score_loss: 0.0058, loss: 2.6972
2022-09-05 08:32:50,745 - INFO - Epoch [10/20][80/440]  lr: 0.00058, eta: 3 days, 10:51:46, mem: 6371, data_time: 26.15, iter_time: 26.78, semantic_loss: 0.7846, offset_loss: 1.4754, cls_loss: 0.2752, mask_loss: 0.1674, iou_score_loss: 0.0108, loss: 2.7135
2022-09-05 08:39:18,155 - INFO - Epoch [10/20][90/440]  lr: 0.00058, eta: 3 days, 7:10:50, mem: 6371, data_time: 29.78, iter_time: 30.56, semantic_loss: 0.6737, offset_loss: 2.3874, cls_loss: 0.8284, mask_loss: 0.1901, iou_score_loss: 0.0141, loss: 4.0937
2022-09-05 08:45:44,075 - INFO - Epoch [10/20][100/440]  lr: 0.00058, eta: 3 days, 4:11:38, mem: 6371, data_time: 67.77, iter_time: 68.23, semantic_loss: 1.1888, offset_loss: 1.5374, cls_loss: 0.3739, mask_loss: 0.1771, iou_score_loss: 0.0122, loss: 3.2894
2022-09-05 08:53:37,423 - INFO - Epoch [10/20][110/440]  lr: 0.00058, eta: 3 days, 2:46:29, mem: 6371, data_time: 17.02, iter_time: 17.51, semantic_loss: 0.4167, offset_loss: 1.9679, cls_loss: 0.4411, mask_loss: 0.2792, iou_score_loss: 0.0032, loss: 3.1082
2022-09-05 09:08:19,021 - INFO - Epoch [10/20][120/440]  lr: 0.00058, eta: 3 days, 6:01:51, mem: 6371, data_time: 12.16, iter_time: 12.72, semantic_loss: 0.4704, offset_loss: 1.5466, cls_loss: 0.4734, mask_loss: 0.0270, iou_score_loss: 0.0035, loss: 2.5209
2022-09-05 09:17:40,020 - INFO - Epoch [10/20][130/440]  lr: 0.00058, eta: 3 days, 5:31:19, mem: 6371, data_time: 54.19, iter_time: 55.02, semantic_loss: 0.5907, offset_loss: 1.6633, cls_loss: 0.1932, mask_loss: 0.4982, iou_score_loss: 0.0087, loss: 2.9542
2022-09-05 09:26:31,235 - INFO - Epoch [10/20][140/440]  lr: 0.00058, eta: 3 days, 4:47:08, mem: 6371, data_time: 39.79, iter_time: 40.32, semantic_loss: 1.2004, offset_loss: 1.6347, cls_loss: 0.3721, mask_loss: 0.2379, iou_score_loss: 0.0004, loss: 3.4455
2022-09-05 09:32:43,499 - INFO - Epoch [10/20][150/440]  lr: 0.00058, eta: 3 days, 2:44:50, mem: 6371, data_time: 31.52, iter_time: 32.24, semantic_loss: 0.5547, offset_loss: 2.4628, cls_loss: 0.3031, mask_loss: 0.3489, iou_score_loss: 0.0148, loss: 3.6843
2022-09-05 09:44:09,719 - INFO - Epoch [10/20][160/440]  lr: 0.00058, eta: 3 days, 3:30:06, mem: 6371, data_time: 75.29, iter_time: 75.88, semantic_loss: 0.9622, offset_loss: 1.8271, cls_loss: 0.3554, mask_loss: 0.0000, iou_score_loss: 0.0000, loss: 3.1446
2022-09-05 10:03:45,105 - INFO - Epoch [10/20][170/440]  lr: 0.00058, eta: 3 days, 7:52:39, mem: 6371, data_time: 108.49, iter_time: 108.66, semantic_loss: 1.0215, offset_loss: 1.8004, cls_loss: 0.1988, mask_loss: 0.0000, iou_score_loss: 0.0000, loss: 3.0206
2022-09-05 10:17:19,575 - INFO - Epoch [10/20][180/440]  lr: 0.00058, eta: 3 days, 9:08:08, mem: 6371, data_time: 113.18, iter_time: 114.05, semantic_loss: 0.6797, offset_loss: 2.1404, cls_loss: 0.2037, mask_loss: 0.2855, iou_score_loss: 0.0047, loss: 3.3140
2022-09-05 10:32:53,777 - INFO - Epoch [10/20][190/440]  lr: 0.00058, eta: 3 days, 11:03:04, mem: 6371, data_time: 67.61, iter_time: 67.90, semantic_loss: 0.8200, offset_loss: 1.3363, cls_loss: 0.2389, mask_loss: 0.2526, iou_score_loss: 0.0146, loss: 2.6623
Traceback (most recent call last):
  File "./tools/train.py", line 191, in <module>
    main()
  File "./tools/train.py", line 184, in main
    train(epoch, model, optimizer, scaler, train_loader, cfg, logger, writer)
  File "./tools/train.py", line 48, in train
    loss, log_vars = model(batch, return_loss=True)
  File "/home/qinyuan/anaconda3/envs/argos_2step/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/qinyuan/anaconda3/envs/argos_2step/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/qinyuan/anaconda3/envs/argos_2step/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/qinyuan/SoftGroup_argos2step/softgroup/model/softgroup.py", line 102, in forward
    return self.forward_train(**batch)
  File "/home/qinyuan/SoftGroup_argos2step/softgroup/util/utils.py", line 171, in wrapper
    return func(*new_args, **new_kwargs)
  File "/home/qinyuan/SoftGroup_argos2step/softgroup/model/softgroup.py", line 126, in forward_train
    self.grouping_cfg)
  File "/home/qinyuan/SoftGroup_argos2step/softgroup/util/fp16.py", line 58, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/qinyuan/SoftGroup_argos2step/softgroup/model/softgroup.py", line 371, in forward_grouping
    proposals_idx = torch.cat(proposals_idx_list, dim=0)
NotImplementedError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors, or that you (the operator writer) forgot to register a fallback function.  Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Python, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, AutocastCPU, Autocast, Batched, VmapMode, Functionalize].

CPU: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/build/aten/src/ATen/RegisterCPU.cpp:21063 [kernel]
CUDA: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/build/aten/src/ATen/RegisterCUDA.cpp:29726 [kernel]
QuantizedCPU: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/build/aten/src/ATen/RegisterQuantizedCPU.cpp:1258 [kernel]
BackendSelect: fallthrough registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/core/PythonFallbackKernel.cpp:47 [backend fallback]
Named: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/ConjugateFallback.cpp:18 [backend fallback]
Negative: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
AutogradCPU: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
AutogradCUDA: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
AutogradXLA: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
AutogradLazy: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
AutogradXPU: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
AutogradMLC: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
AutogradHPU: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
AutogradNestedTensor: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
AutogradPrivateUse1: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
AutogradPrivateUse2: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
AutogradPrivateUse3: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/VariableType_3.cpp:11380 [autograd kernel]
Tracer: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/autograd/generated/TraceType_3.cpp:11220 [kernel]
AutocastCPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/autocast_mode.cpp:461 [backend fallback]
Autocast: fallthrough registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/autocast_mode.cpp:305 [backend fallback]
Batched: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/BatchingRegistrations.cpp:1059 [backend fallback]
VmapMode: fallthrough registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Functionalize: registered at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/FunctionalizeFallbackKernel.cpp:52 [backend fallback]

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 477685) of binary: /home/qinyuan/anaconda3/envs/argos_2step/bin/python
Traceback (most recent call last):
  File "/home/qinyuan/anaconda3/envs/argos_2step/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/home/qinyuan/anaconda3/envs/argos_2step/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/qinyuan/anaconda3/envs/argos_2step/lib/python3.7/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/qinyuan/anaconda3/envs/argos_2step/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home/qinyuan/anaconda3/envs/argos_2step/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/qinyuan/anaconda3/envs/argos_2step/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-09-05_10:39:05
  host      : DeepM
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 477685)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

How can I improve this result and solve the problem? Thank you very much!

thangvubk commented 2 years ago

Hi. in my experiments, i also find that training without class weight achieves at least comparable semantic segmentation performance to with class weight. I think your results are normal.

Fan-QY commented 2 years ago

Hi. in my experiments, i also find that training without class weight achieves at least comparable semantic segmentation performance to with class weight. I think your results are normal.

Thank you for your reply. But when I use the backbone results without weights for further training, the results are very poor and an error occurs at the 9th epoch. Do you have any idea what could be the problem? And am I setting the weights the right way?

Fan-QY commented 2 years ago

Hi. in my experiments, i also find that training without class weight achieves at least comparable semantic segmentation performance to with class weight. I think your results are normal.

And how should I improve the training results of the backbone network now? Thanks so much!

thangvubk commented 2 years ago

Did you pretrain the backbone before training the whole network? And what are semantic mIoU and offset MAE of pretrained model?

Fan-QY commented 2 years ago

Did you pretrain the backbone before training the whole network? And what are semantic mIoU and offset MAE of pretrained model? Yes, I have completed the backbone training, which semantic mIoU Ist 38.0, offset MAE ist 3.000

2022-08-24 08:59:41,476 - INFO - Evaluate semantic segmentation and offset MAE
2022-08-24 08:59:45,174 - INFO - Class-wise mIoU: 40.5 74.9 21.7 0.1 97.4 22.8 0.0 38.9
2022-08-24 08:59:45,174 - INFO - mIoU: 37.0
2022-08-24 08:59:46,006 - INFO - Acc: 96.6
2022-08-24 08:59:52,687 - INFO - Offset MAE: 3.000