training error, cannot start

song6299 commented 5 years ago

Hi! I try to train coco_train2017 data following the step as you shown, but raise an error as follow: 2019-05-15 10:23:24,814 maskrcnn_benchmark.trainer INFO: Start training Traceback (most recent call last): File "/home/work/songping/anaconda3/envs/FCOS/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/work/songping/anaconda3/envs/FCOS/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/work/songping/anaconda3/envs/FCOS/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in main() File "/home/work/songping/anaconda3/envs/FCOS/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/home/work/songping/anaconda3/envs/FCOS/bin/python', '-u', 'tools/train_net.py', '--local_rank=0', '--skip-test', '--config-file', 'configs/fcos/fcos_R_50_FPN_1x.yaml', 'DATALOADER.NUM_WORKERS', '2', 'OUTPUT_DIR', 'training_dir/fcos_R_50_FPN_1x']' died with <Signals.SIGSEGV: 11>. could you help me to solve the problem? thank you

tianzhi0549 commented 5 years ago

@song6299 I suggest that you try to train it with a single GPU to see what happends, using the following command line.

python tools/train_net.py \
    --skip-test \
    --config-file configs/fcos/fcos_R_50_FPN_1x.yaml \
    DATALOADER.NUM_WORKERS 2 \
    OUTPUT_DIR training_dir/fcos_R_50_FPN_1x \
    SOLVER.IMS_PER_BATCH 1

song6299 commented 5 years ago

Thank you for your reply, I have run that command line, now raise error as following: 2019-05-15 15:47:24,208 maskrcnn_benchmark.trainer INFO: Start training Segmentation fault What is the reason of segmentation fault? Another question is I install the environment as INSTALL.md, why the version of pytorch is 1.1.0, does it will influence the experiment?

tianzhi0549 commented 5 years ago

@song6299 Please check https://github.com/tianzhi0549/FCOS/blob/master/TROUBLESHOOTING.md. It might result from your lower GCC version. Pytorch 1.1.0 should not be the reason.

song6299 commented 5 years ago

Thanks, I will install higher gcc~~~

tianzhi0549 / FCOS

training error, cannot start #34