microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.52k stars 4.28k forks source link

CUDA failure 2: out of memory when training model with keras #3040

Open bmigette opened 6 years ago

bmigette commented 6 years ago

I am trying to train a model using Keras and cntk2.4. Everytime I call the fit function, I get a cuda out of memory error.

My GPU has 11GB of RAM, and when it crashes, not even 1.5Gb are used.

My network is a simple single layer LSTM with 5000 neurons:

def LSTM_5000(shape1, shape2,shapeout):
    model = Sequential()
    model.add(LSTM(5000,dropout=0.2,recurrent_dropout=0.2, 
                  input_shape=(shape1, shape2)
             )) #recurrent_dropout=0.2, 
    model.add(Dense(shapeout))
    model.add(Activation('sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics =["accuracy"])
    return model 

Full error:


Using CNTK backend
Selected GPU[0] GeForce GTX 1080 Ti as the process wide default device.
...
C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\cntk\core.py:361: UserWarning: your data is of type "float64", but your input variable (uid "Input664") expects "<class 'numpy.float32'>". Please convert your data beforehand to speed up training.
  (sample.dtype, var.uid, str(var.dtype)))
C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\cntk\core.py:361: UserWarning: your data is of type "float64", but your input variable (uid "Input4") expects "<class 'numpy.float32'>". Please convert your data beforehand to speed up training.
    ...

CUDA failure 2: out of memory ; GPU=0 ; hostname=XXX-PC ; expr=cudaMalloc((void**) &deviceBufferPtr, sizeof(AllocatedElemType) * AsMultipleOf(numElements, 2))
    Traceback (most recent call last):
      File ".\run_tests.py", line 181, in <module>
        runtest(settings,period+'_p'+str(j)+'_'+models_name[i])
      File ".\run_tests.py", line 138, in runtest
        verbose=1, shuffle=False)
      File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\models.py", line 963, in fit
        validation_steps=validation_steps)
      File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\engine\training.py", line 1712, in fit
        validation_steps=validation_steps)
      File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\engine\training.py", line 1235, in _fit_loop
        outs = f(ins_batch)
      File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\backend\cntk_backend.py", line 1823, in __call__
        input_dict, self.trainer_output)
      File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\cntk\train\trainer.py", line 171, in train_minibatch
        output_map, device)
      File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\cntk\cntk_py.py", line 2978, in train_minibatch
        return _cntk_py.Trainer_train_minibatch(self, *args)
    RuntimeError: CUDA failure 2: out of memory ; GPU=0 ; hostname=XXX-PC ; expr=cudaMalloc((void**) &deviceBufferPtr, sizeof(AllocatedElemType) * AsMultipleOf(numElements, 2))

    [CALL STACK]
        > Microsoft::MSR::CNTK::CudaTimer::  Stop
        - Microsoft::MSR::CNTK::CudaTimer::  Stop (x2)
        - Microsoft::MSR::CNTK::GPUMatrix<float>::  Resize
        - Microsoft::MSR::CNTK::Matrix<float>::  Resize
        - std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>
        - std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::  shared_from_this (x3)
        - CNTK::Internal::  UseSparseGradientAggregationInDataParallelSGD
        - std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::  shared_from_this
        - CNTK::Internal::  UseSparseGradientAggregationInDataParallelSGD
        - CNTK::Function::  Forward
        - CNTK::  UniversalLearner
        - CNTK::TrainingParameterSchedule<unsigned __int64>::  Transform
        - CNTK::Trainer::  TotalNumberOfUnitsSeen
ke1337 commented 6 years ago

CUDA reports OOM when trying to allocate a big buffer that exceeds existing memory. Have you tried the same model in other backends? How much memory does it take if no OOM in those backends?

bmigette commented 6 years ago

I've tried with tensorflow, it used approx 9Gb out of 11. Maybe the model I use is too big for GPU/CNTK then ?

ke1337 commented 6 years ago

There are some overheads in CNTK keras backend. Please try save the model in CNTK format and eval using CNTK function directly.

bmigette commented 6 years ago

I'm not sure how to do that... In any case, I am fine using tensorflow for fitting, and either CNTK or tensorflow for predicting later... Note: I tried to save trained model from Keras using C.combine(model.model.outputs).save(name+".cntkmodel") But it gave me an error that it could not convert list to CNTK::Variable if I recall correctly.

ke1337 commented 6 years ago

Can you save it to keras format first and then load, like this?

bmigette commented 6 years ago

Nope I'm getting same error... Should I change to CNTK backend before combining ?

>>> keras_model = load_model('60_p0_LSTM_256_128_128_128_128.h5')
>>> C.combine(keras_model.model.outputs).save('my_cntk_model')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\cntk\internal\swig_helper.py", line 69, in wrapper
    result = f(*args, **kwds)
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\cntk\ops\__init__.py", line 82, in combine
    return combine(operands_unfold, name)
TypeError: cannot convert list element to CNTK::Variable
ke1337 commented 6 years ago

Yes, load the keras model in CNTK backend please.

ahmadalli commented 5 years ago

any update on this? I'm getting the same error with lstm model while tensorflow backend works fine

ahmadalli commented 5 years ago

what I've found out is tensorflow is using shared GPU memory while cntk only uses dedicated GPU memory