rykov8 / ssd_keras

Port of Single Shot MultiBox Detector to Keras
MIT License
1.1k stars 553 forks source link

Training halts after some iteration in first epoch #115

Open deepeshlekhak opened 6 years ago

deepeshlekhak commented 6 years ago

After some iteration in first epoch of training, the process halts with following output.

Epoch 1/10 /home/deepesh/Documents/ssd_traffic/ssd_utils.py:119: RuntimeWarning: divide by zero encountered in log assigned_priors_wh) 2017-10-14 15:53:09.118394: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-14 15:53:17.147597: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-14 15:53:18.300263: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.17GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-14 15:53:18.300311: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.10GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-14 15:53:19.177374: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-14 15:53:20.211245: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.46GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-14 15:53:20.211321: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.94GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-14 15:53:23.378445: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-14 15:53:23.378500: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.15GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-10-14 15:53:25.365196: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 13/14 [==========================>...] - ETA: 8s - loss: 2.9593
No output/error after the above line. What may be causing this problem???

zhongyi-zhou commented 6 years ago

you could lower your batchsize