microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.52k stars 4.28k forks source link

Cuda failure 2: out of memory #2607

Open anubhayal opened 7 years ago

anubhayal commented 7 years ago

I was trying to train the model using FasterRCNN script on AlexNet as base model with a larger dataset of images but after completing 16 Epochs of training it was giving the CUDA failure 2: out of memory error

So please have a look at the following console output for the error and suggest me how can i avoid this.

Finished Epoch[16 of 20]: [Training] loss = 0.348722 18295, metric = 2.15% 18295 14661.193s ( 1.2 samples/s); creating checkpoint file 16 as mModel_checkpointed_16.dnn.ckp at E:/retrain/PretrainedModels creating eval model CUDA failure 2: out of memory ; GPU=0 ; hostname=L-156141402 ; expr=cudaMalloc((void) &deviceBufferPtr, sizeof(AllocatedElemType) AsMultipleOf(numElements, 2)) Traceback (most recent call last): File "run_faster_rcnn.py", line 35, in trained_model = train_faster_rcnn(cfg) File "E:\retrain\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 302, in train_faster_rcnn eval_model = train_faster_rcnn_e2e(cfg) File "E:\retrain\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 341, in train_faster_rcnn_e2e e2e_lr_per_sample_scaled, mm_schedule, cfg["CNTK"].L2_REG_WEIGHT, cfg["CNTK"].E2E_MAX_EPOCHS, cfg) File "E:\retrain\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 591, in train_model i_model = create_faster_rcnn_eval_model(loss, image_input, dims_input, cfg) File "E:\retrain\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 219, in create_faster_rcnn_eval_model roi_fc_layers = clone_model(model, [last_conv_node_name, "rpn_target_rois"], ["cls_score", "bbox_regr"], CloneMethod.freeze) File "E:\retrain\Examples\Image\Detection\FasterRCNN..\FastRCNN\FastRCNN_train.py", line 151, in clone_model cloned_net = combine(to_nodes).clone(clone_method, input_placeholders) File "C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py35\lib\site-packages\cntk\internal\swig_helper.py", line 69, in wrapper result = f(args, kwds) File "C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py35\lib\site-packages\cntk\ops\functions.py", line 562, in clone return super(Function, self).clone(method, substitutions) File "C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py35\lib\site-packages\cntk\cntk_py.py", line 1905, in clone return _cntk_py.Function_clone(self, *args) RuntimeError: CUDA failure 2: out of memory ; GPU=0 ; hostname=L-156141402 ; expr=cudaMalloc((void*) &deviceBufferPtr, sizeof(AllocatedElemType) AsMultipleOf(numElements, 2))

[CALL STACK]

Microsoft::MSR::CNTK::CudaTimer:: Stop

  • Microsoft::MSR::CNTK::CudaTimer:: Stop (x2)
  • Microsoft::MSR::CNTK::GPUMatrix::GPUMatrix
  • Microsoft::MSR::CNTK::MatrixComputeStreamEvent:: Create
  • Microsoft::MSR::CNTK::Matrix::Matrix
  • CNTK::NDArrayView::WritableDataBuffer
  • CNTK::NDArrayView::NDArrayView
  • CNTK::NDArrayView:: NDArrayView
  • CNTK:: MPICommunicator
  • CNTK::NDArrayView:: DeepClone
  • CNTK:: Clip (x5)
frankibem commented 7 years ago

Try reducing the minibatch size.

anubhayal commented 6 years ago

Thanks for your suggestion. I will try by reducing minibatch size and update on the same but one thing i want to bring to your notice that earlier i tried to train model with this configuration and was successful at that time.

So why is it giving CUDA error this time?

FDecaYed commented 6 years ago

@anubhayal can we get some nvidia-smi outputs over the training progress?