megvii-research / FSCE

Apache License 2.0
274 stars 47 forks source link

Exception: process 0 terminated with signal SIGSEGV #44

Closed wongyufei closed 2 years ago

wongyufei commented 2 years ago

Here are the details.

(FSCE) [wangyufei@node03 FSCE]$ CUDA_VISIBLE_DEVICES=8,9 python tools/train_net.py --num-gpus 2 --config-file configs/PASCAL_VOC/base-training/R101_FPN_base_training_split1.yml Command Line Args: Namespace(config_file='configs/PASCAL_VOC/base-training/R101_FPN_base_training_split1.yml', dist_url='tcp://127.0.0.1:50363', end_iter=-1, eval_all=False, eval_during_train=False, eval_iter=-1, eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=[], resume=False, start_iter=-1) [09/13 20:20:43 fsdet]: Rank of current process: 0. World size: 2 [09/13 20:20:43 fsdet]: Command line arguments: Namespace(config_file='configs/PASCAL_VOC/base-training/R101_FPN_base_training_split1.yml', dist_url='tcp://127.0.0.1:50363', end_iter=-1, eval_all=False, eval_during_train=False, eval_iter=-1, eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=[], resume=False, start_iter=-1) [09/13 20:20:43 fsdet]: Contents of args.config_file=configs/PASCAL_VOC/base-training/R101_FPN_base_training_split1.yml: BASE: "../../Base-RCNN-FPN.yaml" MODEL: WEIGHTS: "checkpoints/pretrained_model/R-101.pkl" MASK_ON: False RESNETS: DEPTH: 101 ROI_HEADS: NUM_CLASSES: 15 INPUT: MIN_SIZE_TRAIN: (480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800) MIN_SIZE_TEST: 800 DATASETS: TRAIN: ('voc_2007_trainval_base1', 'voc_2012_trainval_base1') TEST: ('voc_2007_test_base1',) SOLVER: STEPS: (12000, 16000) MAX_ITER: 18000 # 17.4 epochs WARMUP_ITERS: 100 OUTPUT_DIR: "checkpoints/voc/faster_rcnn/faster_rcnn_R_101_FPN_base1"

[09/13 20:20:43 fsdet]: Full config saved to /home/wangyufei/Code/FSCE/checkpoints/voc/faster_rcnn/faster_rcnn_R_101_FPN_base1/config.yaml [09/13 20:20:43 fsdet.utils.env]: Using a generated random seed 43966721 frozen resnet backbone stage 2 (this froze ResNet but not FPN, frozen backbone in rcnn.py will overwrite this) frozen resnet backbone stage 2 (this froze ResNet but not FPN, frozen backbone in rcnn.py will overwrite this) -------- Using Roi Head: StandardROIHeads---------

-------- Using Roi Head: StandardROIHeads---------

[09/13 20:22:41 fsdet.data.build]: Removed 1920 images with no usable annotations. 14631 images left. [09/13 20:22:41 fsdet.data.build]: Distribution of training instances among all 15 categories: category #instances category #instances category #instances
aeroplane 1285 bicycle 1208 boat 1397
bottle 2116 car 4008 cat 1616
chair 4338 diningtable 1057 dog 2079
horse 1156 person 15576 pottedplant 1724
sheep 1347 train 984 tvmonitor 1193
total 41084

[09/13 20:22:41 fsdet.data.detection_utils]: TransformGens used in training: [ResizeShortestEdge(short_edge_length=(480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [09/13 20:22:41 fsdet.data.build]: Using training sampler TrainingSampler [09/13 20:23:14 fvcore.common.checkpoint]: Loading checkpoint from checkpoints/pretrained_model/R-101.pkl [09/13 20:23:14 fsdet.checkpoint.c2_model_loading]: Remapping C2 weights ...... [09/13 20:23:15 fsdet.checkpoint.c2_model_loading]: Some model parameters are not in the checkpoint: backbone.fpn_lateral2.{bias, weight} backbone.fpn_lateral3.{bias, weight} backbone.fpn_lateral4.{bias, weight} backbone.fpn_lateral5.{bias, weight} backbone.fpn_output2.{bias, weight} backbone.fpn_output3.{bias, weight} backbone.fpn_output4.{bias, weight} backbone.fpn_output5.{bias, weight} proposal_generator.anchor_generator.cell_anchors.{0, 1, 2, 3, 4} proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} [09/13 20:23:15 fsdet.checkpoint.c2_model_loading]: The checkpoint contains parameters not used by the model: fc1000_b fc1000_w [09/13 20:23:15 fsdet.engine.train_loop]: Starting training from iteration 0 /home/wangyufei/anaconda3/envs/FSCE/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown len(cache)) Traceback (most recent call last): File "tools/train_net.py", line 130, in args=(args,), File "/home/wangyufei/Code/FSCE/fsdet/engine/launch.py", line 49, in launch daemon=False, File "/home/wangyufei/anaconda3/envs/FSCE/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/wangyufei/anaconda3/envs/FSCE/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGSEGV (FSCE) [wangyufei@node03 FSCE]$ /home/wangyufei/anaconda3/envs/FSCE/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown len(cache))

(FSCE) [wangyufei@node03 FSCE]$