microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.51k stars 4.28k forks source link

Fixing "CUDA failure 2:Out of memory..." issue without rebooting the computer. #1769

Closed min6434 closed 7 years ago

min6434 commented 7 years ago

I tried to increase the size of the convolution kernel and the cntk showed cuda failure error. After I resized the kernel size, the result still shows out of memory error. Also nvidia-smi shows no memery usage. How do I fix this problem?

image

CUDA failure 2: out of memory ; GPU=0 ; hostname=MSI ; expr=cudaMalloc((void) &deviceBufferPtr, sizeof(AllocatedElemType) numElements) Traceback (most recent call last): File "D:\Source\Repos\CNTK_sources\CNTK_Breast\CNNbreast\CNNbreast\CNNbreast.py", line 38, in model_func=create_basic_model_with_batch_normalization) File "D:\Source\Repos\CNTK_sources\CNTK_Breast\CNNbreast\CNNbreast\CNNFunctions.py", line 213, in train_and_evaluate trainer.train_minibatch(data) # update model with it File "H:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py35\lib\site-packages\cntk\train\trainer.py", line 167, in train_minibatch arguments, device) File "H:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py35\lib\site-packages\cntk\cntk_py.py", line 2248, in train_minibatch_overload_for_minibatchdata return _cntk_py.Trainer_train_minibatch_overload_for_minibatchdata(self, args) RuntimeError: CUDA failure 2: out of memory ; GPU=0 ; hostname=MSI ; expr=cudaMalloc((void) &deviceBufferPtr, sizeof(AllocatedElemType) * numElements)

ghost commented 7 years ago

You can find some solutions here stackoverflow - resetting GPU after cuda error - To reset the graphics stack in Windows, press Win+Ctrl+Shift+B.

Generally try to google for the solution first - this is not CNTK specific, even though resetting GPUs via CUDA where supported could be a nice CNTK feature Manually resetting GFX drivers in Windows Resetting GPU without reboot

Another possibility is for you to write a small C CUDA application with cudaDeviceReset API call. Or download and play (at your risk) with devcon - windows device driver consol that could allow you to disable and re-enable the graphics driver (thus resetting the driver/gpu).

min6434 commented 7 years ago

I tried to find the answer but maybe my keywords were wrong. Sorry for the dumb question and thank you very much for the nice answer.

JanKrivanek commented 6 years ago

Is there any other reasone, except for low GPU memory or low available physical RAM (or small continuous block), for this error to occur? We are starting to see this recently despite having planty of free RAM and low GPU memory utilization. Any ideas how to rootcause this?

evo11x commented 3 years ago

I have the same problem with RTX 3060 12GB if the input data is larger than 512, and with RTX 2060 6GB I don't get this error, with the same driver on both cards. If I lower the input layer under 512 then I don't get the error anymore on RTX 3060

Loading data... Using device: GPU[0] GeForce RTX 3060

About to throw exception 'CUBLAS failure 13: CUBLAS_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=PC1; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)' CUBLAS failure 13: CUBLAS_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=PC1; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)

Unhandled Exception: System.ApplicationException: CUBLAS failure 13: CUBLAS_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=PC1; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)

[CALL STACK]

Microsoft::MSR::CNTK::TensorView:: Reshaped

  • Microsoft::MSR::CNTK::CudaTimer:: Stop
  • Microsoft::MSR::CNTK::GPUMatrix:: MultiplyAndWeightedAdd
  • Microsoft::MSR::CNTK::Matrix:: MultiplyAndWeightedAdd
  • Microsoft::MSR::CNTK::TensorView:: DoMatrixProductOf
  • Microsoft::MSR::CNTK::TensorView:: AssignMatrixProductOf
  • std::enable_shared_from_this:: shared_from_this (x3)
  • CNTK::Internal:: UseSparseGradientAggregationInDataParallelSGD
  • CNTK:: CreateTrainer
  • CNTK::Trainer:: TotalNumberOfUnitsSeen
  • CNTK::Trainer:: TrainMinibatch (x2)
  • CSharp_CNTK_TrainerTrainMinibatchSWIG_2
  • 00007FFF157C5E45 (SymFromAddr() error: The specified module could not be found.)