roytseng-tw / Detectron.pytorch

A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available.
MIT License
2.82k stars 544 forks source link

AssertionError: Range subprocess failed (exit code: 1) #63

Closed tunglm2203 closed 6 years ago

tunglm2203 commented 6 years ago

Hi @roytseng-tw When I evaluating training result, I face a problem like below:

INFO subprocess.py: 129: # ---------------------------------------------------------------------------- # INFO subprocess.py: 131: stdout of subprocess 0 with range [1, 1250] INFO subprocess.py: 133: # ---------------------------------------------------------------------------- # Traceback (most recent call last): File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 4, in import cv2 ImportError: No module named cv2 Traceback (most recent call last): File "tools/test_net.py", line 119, in check_expected_results=True) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference all_results = result_getter() File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter multi_gpu=multi_gpu_testing File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 155, in test_net_on_dataset args, dataset_name, proposal_file, num_images, output_dir File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 187, in multi_gpu_test_net_on_dataset args.load_ckpt, args.load_detectron, opts File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 109, in process_in_parallel log_subprocess_output(i, p, output_dir, tag, start, end) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 147, in log_subprocess_output assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret) AssertionError: Range subprocess failed (exit code: 1)

I have installed opencv and successfully imported cv2, but i don't know what is caused to this problem. I have tried solution in https://github.com/facebookresearch/Detectron/issues/349 but it is not helpful. In config file e2e_mask_rcnn_R-50-C4_1x.yaml, I just re-config NUM_GPUS and keep original everything. Can you tell me what is this problem ?

The command that I ran: python3 tools/test_net.py --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-C4_1x.yaml --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth --multi-gpu-testing --output_dir Output_val

System information

roytseng-tw commented 6 years ago

https://github.com/roytseng-tw/Detectron.pytorch/blob/master/lib/utils/subprocess.py#L71 Change python to python3 may solve your problem.

tunglm2203 commented 6 years ago

Thank @roytseng-tw for fastly reply, I modified as your suggested link, the notify ImportError: No module named cv2 is fixed. But the problem about subprocess is still exist.

DEBUG: Run into test_net_data_set() INFO subprocess.py: 88: detection range command 0: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 0 1250 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth INFO subprocess.py: 88: detection range command 1: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 1250 2500 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth INFO subprocess.py: 88: detection range command 2: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 2500 3750 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth INFO subprocess.py: 88: detection range command 3: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 3750 5000 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth INFO subprocess.py: 128: # ---------------------------------------------------------------------------- # INFO subprocess.py: 130: stdout of subprocess 0 with range [1, 1250] INFO subprocess.py: 132: # ---------------------------------------------------------------------------- # INFO test_net.py: 73: Called with args: INFO test_net.py: 74: Namespace(cfg_file='Output_val/detection_range_config.yaml', dataset=None, load_ckpt='Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth', load_detectron=None, multi_gpu_testing=False, output_dir='Output_val', range=[0, 1250], set_cfgs=['TEST.DATASETS', '("coco_2017_val",)'], vis=False) Traceback (most recent call last): File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 76, in assert (torch.cuda.device_count() == 1) ^ bool(args.multi_gpu_testing) AssertionError Traceback (most recent call last): File "tools/test_net.py", line 117, in check_expected_results=True) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference all_results = result_getter() File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter multi_gpu=multi_gpu_testing File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 155, in test_net_on_dataset args, dataset_name, proposal_file, num_images, output_dir File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 187, in multi_gpu_test_net_on_dataset args.load_ckpt, args.load_detectron, opts File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 108, in process_in_parallel log_subprocess_output(i, p, output_dir, tag, start, end) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 146, in log_subprocess_output assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret) AssertionError: Range subprocess failed (exit code: 1)

tunglm2203 commented 6 years ago

I have modified the command in https://github.com/roytseng-tw/Detectron.pytorch/blob/master/lib/utils/subprocess.py#L71 by adding --multi-gpu-testing', but there another problem: INFO test_engine.py: 330: loading checkpoint Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth Traceback (most recent call last): File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 118, in check_expected_results=True) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference all_results = result_getter() File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 125, in result_getter gpu_id=gpu_id File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 253, in test_net cls_boxes_i, cls_segms_i, cls_keyps_i = im_detect_all(model, im, box_proposals, timers) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test.py", line 70, in im_detect_all model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, box_proposals) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test.py", line 139, in im_detect_bbox return_dict = model(*inputs) File "/mnt/hdd/tung/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(input, **kwargs) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/nn/parallel/data_parallel.py", line 82, in forward mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()]) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/nn/parallel/data_parallel.py", line 82, in mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()]) IndexError: list index out of range Traceback (most recent call last): File "tools/test_net.py", line 118, in check_expected_results=True) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference all_results = result_getter() File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter multi_gpu=multi_gpu_testing File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 154, in test_net_on_dataset args, dataset_name, proposal_file, num_images, output_dir File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 186, in multi_gpu_test_net_on_dataset args.load_ckpt, args.load_detectron, opts File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 107, in process_in_parallel log_subprocess_output(i, p, output_dir, tag, start, end) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 145, in log_subprocess_output assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret) AssertionError: Range subprocess failed (exit code: 1)

roytseng-tw commented 6 years ago

You should checkout the inference section in README. Specify --multi-gpu-testing if multiple gpus are available.

tunglm2203 commented 6 years ago

Thank @roytseng-tw , actually, I have passed --multi-gpu-testing in my command: python3 tools/test_net.py --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-C4_1x.yaml --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth --multi-gpu-testing --output_dir Output_val

But in https://github.com/roytseng-tw/Detectron.pytorch/blob/1833c71a62e389d2b5f873f40a914c5a47bdd8a2/lib/utils/subprocess.py#L71 , --multi-gpu-testing have not pass to subprocess, I have changed command passed to subprocess, but there another problem like above.

roytseng-tw commented 6 years ago

You should not change anything except python --> python3.

tunglm2203 commented 6 years ago

Update: @roytseng-tw yes, I keep everything as you said. I have tried evaluate in only one GPU, it run successfully, but when I pass -multi-gpu-testing in my command, and specific gpu device through CUDA_VISIBLE_DEVICES. It still gets error

roytseng-tw commented 6 years ago

You should not pass --multi-gpu-testing to subprocesses, and what's the error ?

tunglm2203 commented 6 years ago

Here is error, INFO test_engine.py: 330: loading checkpoint Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 626/1250 1.590s + 0.056s (eta: 0:17:06) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 636/1250 0.365s + 0.030s (eta: 0:04:02) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 646/1250 0.330s + 0.028s (eta: 0:03:36) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 656/1250 0.337s + 0.031s (eta: 0:03:38) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 666/1250 0.333s + 0.029s (eta: 0:03:31) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 676/1250 0.313s + 0.027s (eta: 0:03:15) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 686/1250 0.314s + 0.025s (eta: 0:03:11) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 696/1250 0.305s + 0.024s (eta: 0:03:02) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 706/1250 0.298s + 0.023s (eta: 0:02:54) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 716/1250 0.307s + 0.024s (eta: 0:02:56) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 726/1250 0.305s + 0.024s (eta: 0:02:52) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 736/1250 0.301s + 0.024s (eta: 0:02:46) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 746/1250 0.302s + 0.023s (eta: 0:02:44) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 756/1250 0.298s + 0.023s (eta: 0:02:38) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 766/1250 0.298s + 0.022s (eta: 0:02:35) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 776/1250 0.296s + 0.022s (eta: 0:02:30) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 786/1250 0.295s + 0.022s (eta: 0:02:27) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 796/1250 0.290s + 0.022s (eta: 0:02:21) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 806/1250 0.293s + 0.023s (eta: 0:02:20) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 816/1250 0.292s + 0.022s (eta: 0:02:16) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 826/1250 0.292s + 0.022s (eta: 0:02:13) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 836/1250 0.293s + 0.022s (eta: 0:02:10) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 846/1250 0.296s + 0.022s (eta: 0:02:08) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 856/1250 0.297s + 0.022s (eta: 0:02:05) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 866/1250 0.296s + 0.022s (eta: 0:02:01) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 876/1250 0.295s + 0.022s (eta: 0:01:58) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 886/1250 0.294s + 0.022s (eta: 0:01:54) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 896/1250 0.292s + 0.021s (eta: 0:01:51) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 906/1250 0.292s + 0.021s (eta: 0:01:47) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 916/1250 0.291s + 0.021s (eta: 0:01:44) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 926/1250 0.292s + 0.022s (eta: 0:01:41) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 936/1250 0.291s + 0.021s (eta: 0:01:38) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 946/1250 0.289s + 0.021s (eta: 0:01:34) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 956/1250 0.288s + 0.021s (eta: 0:01:30) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 966/1250 0.287s + 0.021s (eta: 0:01:27) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 976/1250 0.287s + 0.022s (eta: 0:01:24) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 986/1250 0.287s + 0.021s (eta: 0:01:21) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 996/1250 0.285s + 0.021s (eta: 0:01:17) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1006/1250 0.287s + 0.021s (eta: 0:01:15) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1016/1250 0.287s + 0.021s (eta: 0:01:12) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1026/1250 0.289s + 0.022s (eta: 0:01:09) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1036/1250 0.289s + 0.022s (eta: 0:01:06) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1046/1250 0.289s + 0.022s (eta: 0:01:03) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1056/1250 0.288s + 0.021s (eta: 0:01:00) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1066/1250 0.288s + 0.021s (eta: 0:00:56) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1076/1250 0.287s + 0.021s (eta: 0:00:53) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1086/1250 0.287s + 0.021s (eta: 0:00:50) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1096/1250 0.287s + 0.021s (eta: 0:00:47) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1106/1250 0.288s + 0.021s (eta: 0:00:44) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1116/1250 0.287s + 0.021s (eta: 0:00:41) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1126/1250 0.287s + 0.021s (eta: 0:00:38) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1136/1250 0.288s + 0.021s (eta: 0:00:35) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1146/1250 0.289s + 0.021s (eta: 0:00:32) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1156/1250 0.290s + 0.021s (eta: 0:00:29) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1166/1250 0.290s + 0.021s (eta: 0:00:26) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1176/1250 0.289s + 0.021s (eta: 0:00:22) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1186/1250 0.289s + 0.021s (eta: 0:00:19) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1196/1250 0.289s + 0.021s (eta: 0:00:16) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1206/1250 0.289s + 0.021s (eta: 0:00:13) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1216/1250 0.292s + 0.022s (eta: 0:00:10) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1226/1250 0.291s + 0.022s (eta: 0:00:07) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1236/1250 0.291s + 0.022s (eta: 0:00:04) INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1246/1250 0.291s + 0.022s (eta: 0:00:01) INFO test_engine.py: 314: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/test_output/detection_range_625_1250.pkl

INFO test_engine.py: 211: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/test_output/detections.pkl INFO test_engine.py: 161: Total inference time: 212.193s INFO task_evaluation.py: 75: Evaluating detections Traceback (most recent call last): File "tools/test_net.py", line 118, in check_expected_results=True) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference all_results = result_getter() File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter multi_gpu=multi_gpu_testing File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 163, in test_net_on_dataset dataset, all_boxes, all_segms, all_keyps, output_dir File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 59, in evaluate_all dataset, all_boxes, output_dir, use_matlab=use_matlab File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 79, in evaluate_boxes dataset, all_boxes, output_dir, use_salt=not_comp, cleanup=not_comp File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 135, in evaluate_boxes _write_coco_bbox_results_file(json_dataset, all_boxes, res_file) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 160, in _write_coco_bbox_results_file json_dataset, all_boxes[cls_ind], cat_id)) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 171, in _coco_bbox_results_one_category assert len(boxes) == len(image_ids) AssertionError

To clear, In config file: e2e_mask_rcnn_R-50-C4_1x.yaml, I set NUM_GPUS=2, and I pass gpu id through CUDA_VISIBLE_DEVICES=5,6

roytseng-tw commented 6 years ago

First, you don't need to change anything in the config file if you use CUDA_VISIBLE_DEVICES to set available gpus. Second, from your log, It's like that you were using 8 gpus to run the testing (range [626, 1250] of 5000) instead of 2.

Below is my deduction: You are on a machine of 8 gpus (5000/625) , and you didn't successfully set CUDA_VISIBLE_DEVICES.

CUDA_VISIBLE_DEVICES is a environment variable checked by cuda driver. To set it, you can either do 1) export CUDA_VISIBLE_DEVICES=5,6 2) CUDA_VISIBLE_DEVICES=5,6 python tools/test_net.py ...

tunglm2203 commented 6 years ago

Yes, I am on machine with 8 GPUs, but I am only allowed to run on 2 GPUs, so I want to use only 2 GPUs 5 and 6. I ran as you said: CUDA_VISIBLE_DEVICES=5,6 && python tools/test_net.py .... Here is tail of my log

INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1246/1250 0.292s + 0.021s (eta: 0:00:01) INFO test_engine.py: 314: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_625_1250.pkl

INFO test_engine.py: 211: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detections.pkl INFO test_engine.py: 161: Total inference time: 212.159s INFO task_evaluation.py: 75: Evaluating detections Traceback (most recent call last): File "tools/test_net.py", line 111, in check_expected_results=True) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference all_results = result_getter() File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter multi_gpu=multi_gpu_testing File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 163, in test_net_on_dataset dataset, all_boxes, all_segms, all_keyps, output_dir File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 59, in evaluate_all dataset, all_boxes, output_dir, use_matlab=use_matlab File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 79, in evaluate_boxes dataset, all_boxes, output_dir, use_salt=not_comp, cleanup=not_comp File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 135, in evaluate_boxes _write_coco_bbox_results_file(json_dataset, all_boxes, res_file) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 160, in _write_coco_bbox_results_file json_dataset, all_boxes[cls_ind], cat_id)) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 171, in _coco_bbox_results_one_category assert len(boxes) == len(image_ids) AssertionError

roytseng-tw commented 6 years ago

I find a weird thing in your log range [626, 1250] of 5000: 1246/1250: length of dataset and indices do not match ! Are you using a clean code ?

tunglm2203 commented 6 years ago

I only add this line in head of test_net.py file and keep everything: os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="2,3"

I see that process is divided into 2 subprocess, first range is [1, 2500], but it fails in assert like log below.

Here my new full log: INFO test_net.py: 70: Called with args: INFO test_net.py: 71: Namespace(cfg_file='configs/e2e_mask_rcnn_R-50-C4_1x.yaml', dataset='coco2017', load_ckpt='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth', load_detectron=None, multi_gpu_testing=True, output_dir=None, range=None, set_cfgs=[], vis=False) INFO test_net.py: 81: Automatically set output directory to /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test INFO test_net.py: 102: Testing with config: INFO test_net.py: 103: {'BBOX_XFORM_CLIP': 4.135166556742356, 'CROP_RESIZE_WITH_MAX_POOL': True, 'CUDA': False, 'DATA_DIR': '/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/data', 'DATA_LOADER': {'NUM_THREADS': 4}, 'DEBUG': False, 'DEDUP_BOXES': 0.0625, 'EPS': 1e-14, 'EXPECTED_RESULTS': [], 'EXPECTED_RESULTS_ATOL': 0.005, 'EXPECTED_RESULTS_EMAIL': '', 'EXPECTED_RESULTS_RTOL': 0.1, 'FAST_RCNN': {'MLP_HEAD_DIM': 1024, 'ROI_BOX_HEAD': 'ResNet.ResNet_roi_conv5_head', 'ROI_XFORM_METHOD': 'RoIAlign', 'ROI_XFORM_RESOLUTION': 14, 'ROI_XFORM_SAMPLING_RATIO': 0}, 'FPN': {'COARSEST_STRIDE': 32, 'DIM': 256, 'EXTRA_CONV_LEVELS': False, 'FPN_ON': False, 'MULTILEVEL_ROIS': False, 'MULTILEVEL_RPN': False, 'ROI_CANONICAL_LEVEL': 4, 'ROI_CANONICAL_SCALE': 224, 'ROI_MAX_LEVEL': 5, 'ROI_MIN_LEVEL': 2, 'RPN_ANCHOR_START_SIZE': 32, 'RPN_ASPECT_RATIOS': (0.5, 1, 2), 'RPN_COLLECT_SCALE': 1, 'RPN_MAX_LEVEL': 6, 'RPN_MIN_LEVEL': 2, 'ZERO_INIT_LATERAL': False}, 'KRCNN': {'CONV_HEAD_DIM': 256, 'CONV_HEAD_KERNEL': 3, 'CONV_INIT': 'GaussianFill', 'DECONV_DIM': 256, 'DECONV_KERNEL': 4, 'DILATION': 1, 'HEATMAP_SIZE': -1, 'INFERENCE_MIN_SIZE': 0, 'KEYPOINT_CONFIDENCE': 'bbox', 'LOSS_WEIGHT': 1.0, 'MIN_KEYPOINT_COUNT_FOR_VALID_MINIBATCH': 20, 'NMS_OKS': False, 'NORMALIZE_BY_VISIBLE_KEYPOINTS': True, 'NUM_KEYPOINTS': -1, 'NUM_STACKED_CONVS': 8, 'ROI_KEYPOINTS_HEAD': '', 'ROI_XFORM_METHOD': 'RoIAlign', 'ROI_XFORM_RESOLUTION': 7, 'ROI_XFORM_SAMPLING_RATIO': 0, 'UP_SCALE': -1, 'USE_DECONV': False, 'USE_DECONV_OUTPUT': False}, 'MATLAB': 'matlab', 'MODEL': {'BBOX_REG_WEIGHTS': (10.0, 10.0, 5.0, 5.0), 'CLS_AGNOSTIC_BBOX_REG': False, 'CONV_BODY': 'ResNet.ResNet50_conv4_body', 'FASTER_RCNN': True, 'KEYPOINTS_ON': False, 'LOAD_IMAGENET_PRETRAINED_WEIGHTS': True, 'MASK_ON': True, 'NUM_CLASSES': 81, 'RPN_ONLY': False, 'SHARE_RES5': True, 'TYPE': 'generalized_rcnn', 'UNSUPERVISED_POSE': False}, 'MRCNN': {'CLS_SPECIFIC_MASK': True, 'CONV_INIT': 'MSRAFill', 'DILATION': 1, 'DIM_REDUCED': 256, 'MEMORY_EFFICIENT_LOSS': True, 'RESOLUTION': 14, 'ROI_MASK_HEAD': 'mask_rcnn_heads.mask_rcnn_fcn_head_v0upshare', 'ROI_XFORM_METHOD': 'RoIAlign', 'ROI_XFORM_RESOLUTION': 14, 'ROI_XFORM_SAMPLING_RATIO': 0, 'THRESH_BINARIZE': 0.5, 'UPSAMPLE_RATIO': 1, 'USE_FC_OUTPUT': False, 'WEIGHT_LOSS_MASK': 1.0}, 'NUM_GPUS': 8, 'OUTPUT_DIR': 'Outputs', 'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]), 'POOLING_MODE': 'crop', 'POOLING_SIZE': 7, 'PYTORCH_VERSION_LESS_THAN_040': True, 'RESNETS': {'FREEZE_AT': 2, 'IMAGENET_PRETRAINED_WEIGHTS': 'data/pretrained_model/resnet50_caffe.pth', 'NUM_GROUPS': 1, 'RES5_DILATION': 1, 'STRIDE_1X1': True, 'TRANS_FUNC': 'bottleneck_transformation', 'WIDTH_PER_GROUP': 64}, 'RETINANET': {'ANCHOR_SCALE': 4, 'ASPECT_RATIOS': (0.5, 1.0, 2.0), 'BBOX_REG_BETA': 0.11, 'BBOX_REG_WEIGHT': 1.0, 'CLASS_SPECIFIC_BBOX': False, 'INFERENCE_TH': 0.05, 'LOSS_ALPHA': 0.25, 'LOSS_GAMMA': 2.0, 'NEGATIVE_OVERLAP': 0.4, 'NUM_CONVS': 4, 'POSITIVE_OVERLAP': 0.5, 'PRE_NMS_TOP_N': 1000, 'PRIOR_PROB': 0.01, 'RETINANET_ON': False, 'SCALES_PER_OCTAVE': 3, 'SHARE_CLS_BBOX_TOWER': False, 'SOFTMAX': False}, 'RFCN': {'PS_GRID_SIZE': 3}, 'RNG_SEED': 3, 'ROOT_DIR': '/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch', 'RPN': {'ASPECT_RATIOS': (0.5, 1, 2), 'CLS_ACTIVATION': 'sigmoid', 'OUT_DIM': 512, 'OUT_DIM_AS_IN_DIM': True, 'RPN_ON': True, 'SIZES': (32, 64, 128, 256, 512), 'STRIDE': 16}, 'SOLVER': {'BASE_LR': 0.01, 'BIAS_DOUBLE_LR': True, 'BIAS_WEIGHT_DECAY': False, 'GAMMA': 0.1, 'LOG_LR_CHANGE_THRESHOLD': 1.1, 'LRS': [], 'LR_POLICY': 'steps_with_decay', 'MAX_ITER': 180000, 'MOMENTUM': 0.9, 'SCALE_MOMENTUM': True, 'SCALE_MOMENTUM_THRESHOLD': 1.1, 'STEPS': [0, 120000, 160000], 'STEP_SIZE': 30000, 'TYPE': 'SGD', 'WARM_UP_FACTOR': 0.3333333333333333, 'WARM_UP_ITERS': 500, 'WARM_UP_METHOD': 'linear', 'WEIGHT_DECAY': 0.0001}, 'TEST': {'BBOX_AUG': {'AREA_TH_HI': 32400, 'AREA_TH_LO': 2500, 'ASPECT_RATIOS': (), 'ASPECT_RATIO_H_FLIP': False, 'COORD_HEUR': 'UNION', 'ENABLED': False, 'H_FLIP': False, 'MAX_SIZE': 4000, 'SCALES': (), 'SCALE_H_FLIP': False, 'SCALE_SIZE_DEP': False, 'SCORE_HEUR': 'UNION'}, 'BBOX_REG': True, 'BBOX_VOTE': {'ENABLED': False, 'SCORING_METHOD': 'ID', 'SCORING_METHOD_BETA': 1.0, 'VOTE_TH': 0.8}, 'COMPETITION_MODE': True, 'DATASETS': ('coco_2017_val',), 'DETECTIONS_PER_IM': 100, 'FORCE_JSON_DATASET_EVAL': False, 'KPS_AUG': {'AREA_TH': 32400, 'ASPECT_RATIOS': (), 'ASPECT_RATIO_H_FLIP': False, 'ENABLED': False, 'HEUR': 'HM_AVG', 'H_FLIP': False, 'MAX_SIZE': 4000, 'SCALES': (), 'SCALE_H_FLIP': False, 'SCALE_SIZE_DEP': False}, 'MASK_AUG': {'AREA_TH': 32400, 'ASPECT_RATIOS': (), 'ASPECT_RATIO_H_FLIP': False, 'ENABLED': False, 'HEUR': 'SOFT_AVG', 'H_FLIP': False, 'MAX_SIZE': 4000, 'SCALES': (), 'SCALE_H_FLIP': False, 'SCALE_SIZE_DEP': False}, 'MAX_SIZE': 1333, 'NMS': 0.5, 'PRECOMPUTED_PROPOSALS': False, 'PROPOSAL_FILES': (), 'PROPOSAL_LIMIT': 2000, 'RPN_MIN_SIZE': 0, 'RPN_NMS_THRESH': 0.7, 'RPN_POST_NMS_TOP_N': 1000, 'RPN_PRE_NMS_TOP_N': 6000, 'SCALE': 800, 'SCORE_THRESH': 0.05, 'SOFT_NMS': {'ENABLED': False, 'METHOD': 'linear', 'SIGMA': 0.5}}, 'TRAIN': {'ASPECT_CROPPING': False, 'ASPECT_GROUPING': True, 'ASPECT_HI': 2, 'ASPECT_LO': 0.5, 'BATCH_SIZE_PER_IM': 512, 'BBOX_INSIDE_WEIGHTS': (1.0, 1.0, 1.0, 1.0), 'BBOX_NORMALIZE_MEANS': (0.0, 0.0, 0.0, 0.0), 'BBOX_NORMALIZE_STDS': (0.1, 0.1, 0.2, 0.2), 'BBOX_NORMALIZE_TARGETS': True, 'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': False, 'BBOX_THRESH': 0.5, 'BG_THRESH_HI': 0.5, 'BG_THRESH_LO': 0.0, 'CROWD_FILTER_THRESH': 0.7, 'DATASETS': (), 'FG_FRACTION': 0.25, 'FG_THRESH': 0.5, 'FREEZE_CONV_BODY': False, 'GT_MIN_AREA': -1, 'IMS_PER_BATCH': 1, 'MAX_SIZE': 1333, 'PROPOSAL_FILES': (), 'RPN_BATCH_SIZE_PER_IM': 256, 'RPN_FG_FRACTION': 0.5, 'RPN_MIN_SIZE': 0, 'RPN_NEGATIVE_OVERLAP': 0.3, 'RPN_NMS_THRESH': 0.7, 'RPN_POSITIVE_OVERLAP': 0.7, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 12000, 'RPN_STRADDLE_THRESH': 0, 'SCALES': (800,), 'SNAPSHOT_ITERS': 20000, 'USE_FLIPPED': True}, 'VIS': False, 'VIS_TH': 0.9} loading annotations into memory... Done (t=0.72s) creating index... index created! INFO subprocess.py: 87: detection range command 0: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 0 2500 --cfg /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test --load_ckpt /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth INFO subprocess.py: 87: detection range command 1: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 2500 5000 --cfg /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test --load_ckpt /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth INFO subprocess.py: 127: # ---------------------------------------------------------------------------- # INFO subprocess.py: 129: stdout of subprocess 0 with range [1, 2500] INFO subprocess.py: 131: # ---------------------------------------------------------------------------- # INFO test_net.py: 70: Called with args: INFO test_net.py: 71: Namespace(cfg_file='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_config.yaml', dataset=None, load_ckpt='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth', load_detectron=None, multi_gpu_testing=False, output_dir='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test', range=[0, 2500], set_cfgs=['TEST.DATASETS', '("coco_2017_val",)'], vis=False) Traceback (most recent call last): File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 73, in assert (torch.cuda.device_count() == 1) ^ bool(args.multi_gpu_testing) AssertionError Traceback (most recent call last): File "tools/test_net.py", line 114, in check_expected_results=True) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference all_results = result_getter() File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter multi_gpu=multi_gpu_testing File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 154, in test_net_on_dataset args, dataset_name, proposal_file, num_images, output_dir File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 186, in multi_gpu_test_net_on_dataset args.load_ckpt, args.load_detectron, opts File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 107, in process_in_parallel log_subprocess_output(i, p, output_dir, tag, start, end) File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 145, in log_subprocess_output assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret) AssertionError: Range subprocess failed (exit code: 1)

roytseng-tw commented 6 years ago

You should not add os.environ["CUDA_VISIBLE_DEVICES"]="2,3" to test_net.py

tunglm2203 commented 6 years ago

@roytseng-tw if I don't add this, the command CUDA_VISIBLE_DEVICES=5,6 && python tools/test_net.py... may be not successfully, it still detect 8 GPUs.

roytseng-tw commented 6 years ago

What's the output of this for you

CUDA_VISIBLE_DEVICES=5,6 python -c "import torch; print(torch.cuda.device_count())"
tunglm2203 commented 6 years ago

@roytseng-tw Output is 2

tunglm2203 commented 6 years ago

I try command: CUDA_VISIBLE_DEVICES=5,6 python tools/test_net.py ... instead of CUDA_VISIBLE_DEVICES=5,6 && python tools/test_net.py ... It may be helpful, I see process divide into 2 process with first range is [1, 2500], wait until run to next range ...

tunglm2203 commented 6 years ago

@roytseng-tw I have run sucessfully, thank you, I don't know why it not detect GPU device ID when I use && to concatenate command. Once again, thank you so much !