Out of memory when training for k-fold Cross Validation #455

Open pikaplan opened 7 years ago

pikaplan commented 7 years ago


As advised in #187, I tried to call fit() multiple times to train a different model in each fold, but this leads to an out of GPU memory error, apparently the same as in #248. I am using the AlexNet provided in your examples with a small dataset of 120 test 378 train images for 8 classes in each fold.

# Trains 10 models in 10 folds
for fold_index in range(0,10):
    X = fold.TrainInputs
    Y = fold.TrainOutputs
    TX = fold.TestInputs
    TY = fold.TestOutputs

    # Prepares the names of the active experiment's fold
    sRunID = 'run_' + sExperimentName + '_%02i' % (fold_index + 1)
    sModelFileName = 'Model_' + sExperimentName + '_model_%02i.nn' % (fold_index + 1)

    # Trains the fold and saves the model
    with tf.Graph().as_default(), tf.Session() as sess:

        print('\r\nInitializing model')
        model = tflearn.DNN(network, checkpoint_path='checkpoint_' + sExperimentName,max_checkpoints=1, tensorboard_verbose=2)

        print('\r\nTraining model for fold %02i' % (fold_index + 1)), Y, n_epoch=3, validation_set=(TX,TY), shuffle=False,
                  show_metric=True, batch_size=60, snapshot_step=3,
                  snapshot_epoch=False, run_id=sRunID)

        print('\r\nSaving model for fold %02i' % (fold_index + 1))        + sModelFileName)

When I place the model initialization network=AlexNet() and model=tflearn.DNN before the start of the for loop, presumably the solution of #248, the resources are not exhausted. But the model weights persisted to the next fold, so we have a fine-tuning instead of an independent training that is needed for 10 fold validation. Is there a way to reset the model weights before each fit() or to flush the old model/trainer from the GPU memory?

Thank you in advance for your reply and if I come up with some solution I could gladly contribute.

PS.:The error at the second iteration is the following:

tensorflow/core/common_runtime/] Ran out of memory trying to allocate 3.75MiB. See logs for memory state. tensorflow/core/framework/] Resource exhausted: OOM when allocating tensor with shape[60,256,8,8] tensorflow/core/client/] OOM when allocating tensor with shape[4096,4096] [[Node: Momentum/gradients/nn_fc2/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Dropout/cond/Merge, Momentum/gradients/nn_fc2/Tanh_grad/TanhGrad)]] [[Node: Momentum/clip_by_global_norm/Momentum/clip_by_global_norm/_13/_70 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_504_Momentum/clip_by_global_norm/Momentum/clip_by_global_norm/_13", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

aymericdamien commented 7 years ago

You can load/save model at every loop to keep weights. Also note that DNN has its own session, so you do not need to run

Also this is not needed, because you already encapsulate your graph:

shumailaahmed commented 3 years ago

i was running into same problem, what worked for me was having clean code, less reallocation of data subsets in different variables in loop, and deleting variables and cleaning memory at end of each k-fold ittr something like this:

#in imports
import gc
#for loop
 del fold
 del X
 del Y
 del TX 
 del TY 