reboot during training faster-rcnn with ubuntu 14.04 cuda-8.0 nvidia-driver:367.48

Today, I tried training VOC2007 for object detection with faster-rcnn on my Dell server. The information of my server is: uname -a Linux sem-PowerEdge-T630 4.2.0-27-generic #32~14.04.1-Ubuntu SMP Fri Jan 22 15:32:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux And the information of My GPU is: 04:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1) The dataset VOC2007 contains 9966 pictures, and each picture' s size is about 300*500. When I trained the data on faster-rcnn framwork, the server reboot after 200~400 iters. I record nvidia-smi every 0.01 seconds, and the last record before reboot was: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.48 Driver Version: 367.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 970 Off | 0000:04:00.0 Off | N/A | | 43% 53C P2 94W / 151W | 2042MiB / 4036MiB | 11% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 15911 C python 2040MiB | +-----------------------------------------------------------------------------+

I tried several times, nothing changed. Can anyone help? Is there any problem with the huge data and small gpu memory? But the last time gpu memory is 2042MB/4036MB. I am confused.

rbgirshick / py-faster-rcnn

reboot during training faster-rcnn with ubuntu 14.04 cuda-8.0 nvidia-driver:367.48 #404