Open anubhayal opened 7 years ago
Try reducing the minibatch size.
Thanks for your suggestion. I will try by reducing minibatch size and update on the same but one thing i want to bring to your notice that earlier i tried to train model with this configuration and was successful at that time.
So why is it giving CUDA error this time?
@anubhayal can we get some nvidia-smi outputs over the training progress?
I was trying to train the model using FasterRCNN script on AlexNet as base model with a larger dataset of images but after completing 16 Epochs of training it was giving the CUDA failure 2: out of memory error
So please have a look at the following console output for the error and suggest me how can i avoid this.
Finished Epoch[16 of 20]: [Training] loss = 0.348722 18295, metric = 2.15% 18295 14661.193s ( 1.2 samples/s); creating checkpoint file 16 as mModel_checkpointed_16.dnn.ckp at E:/retrain/PretrainedModels creating eval model CUDA failure 2: out of memory ; GPU=0 ; hostname=L-156141402 ; expr=cudaMalloc((void) &deviceBufferPtr, sizeof(AllocatedElemType) AsMultipleOf(numElements, 2)) Traceback (most recent call last): File "run_faster_rcnn.py", line 35, in
trained_model = train_faster_rcnn(cfg)
File "E:\retrain\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 302, in train_faster_rcnn
eval_model = train_faster_rcnn_e2e(cfg)
File "E:\retrain\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 341, in train_faster_rcnn_e2e
e2e_lr_per_sample_scaled, mm_schedule, cfg["CNTK"].L2_REG_WEIGHT, cfg["CNTK"].E2E_MAX_EPOCHS, cfg)
File "E:\retrain\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 591, in train_model
i_model = create_faster_rcnn_eval_model(loss, image_input, dims_input, cfg)
File "E:\retrain\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 219, in create_faster_rcnn_eval_model
roi_fc_layers = clone_model(model, [last_conv_node_name, "rpn_target_rois"], ["cls_score", "bbox_regr"], CloneMethod.freeze)
File "E:\retrain\Examples\Image\Detection\FasterRCNN..\FastRCNN\FastRCNN_train.py", line 151, in clone_model
cloned_net = combine(to_nodes).clone(clone_method, input_placeholders)
File "C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py35\lib\site-packages\cntk\internal\swig_helper.py", line 69, in wrapper
result = f( args, kwds)
File "C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py35\lib\site-packages\cntk\ops\functions.py", line 562, in clone
return super(Function, self).clone(method, substitutions)
File "C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py35\lib\site-packages\cntk\cntk_py.py", line 1905, in clone
return _cntk_py.Function_clone(self, *args)
RuntimeError: CUDA failure 2: out of memory ; GPU=0 ; hostname=L-156141402 ; expr=cudaMalloc((void*) &deviceBufferPtr, sizeof(AllocatedElemType) AsMultipleOf(numElements, 2))
[CALL STACK]