Out of memory when training for k-fold Cross Validation

pikaplan commented 7 years ago

Hi,

As advised in #187, I tried to call fit() multiple times to train a different model in each fold, but this leads to an out of GPU memory error, apparently the same as in #248. I am using the AlexNet provided in your examples with a small dataset of 120 test 378 train images for 8 classes in each fold.

# Trains 10 models in 10 folds
for fold_index in range(0,10):
    fold=Folds[fold_index]
    X = fold.TrainInputs
    Y = fold.TrainOutputs
    TX = fold.TestInputs
    TY = fold.TestOutputs

    # Prepares the names of the active experiment's fold
    sRunID = 'run_' + sExperimentName + '_%02i' % (fold_index + 1)
    sModelFileName = 'Model_' + sExperimentName + '_model_%02i.nn' % (fold_index + 1)

    # Trains the fold and saves the model
    tf.reset_default_graph()              
    with tf.Graph().as_default(), tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        tflearn.config.init_training_mode()

        print('\r\nInitializing model')
        network=AlexNet()
        model = tflearn.DNN(network, checkpoint_path='checkpoint_' + sExperimentName,max_checkpoints=1, tensorboard_verbose=2)

        print('\r\nTraining model for fold %02i' % (fold_index + 1))
        model.fit(X, Y, n_epoch=3, validation_set=(TX,TY), shuffle=False,
                  show_metric=True, batch_size=60, snapshot_step=3,
                  snapshot_epoch=False, run_id=sRunID)

        print('\r\nSaving model for fold %02i' % (fold_index + 1))        
        model.save(sModelFolder + sModelFileName)

When I place the model initialization network=AlexNet() and model=tflearn.DNN before the start of the for loop, presumably the solution of #248, the resources are not exhausted. But the model weights persisted to the next fold, so we have a fine-tuning instead of an independent training that is needed for 10 fold validation. Is there a way to reset the model weights before each fit() or to flush the old model/trainer from the GPU memory?

Thank you in advance for your reply and if I come up with some solution I could gladly contribute.

PS.:The error at the second iteration is the following:

tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 3.75MiB. See logs for memory state. tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[60,256,8,8] tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[4096,4096] [[Node: Momentum/gradients/nn_fc2/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Dropout/cond/Merge, Momentum/gradients/nn_fc2/Tanh_grad/TanhGrad)]] [[Node: Momentum/clip_by_global_norm/Momentum/clip_by_global_norm/_13/_70 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_504_Momentum/clip_by_global_norm/Momentum/clip_by_global_norm/_13", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'Momentum/gradients/nn_fc2/MatMul_grad/MatMul_1', defined at: File "/home/researcher/tens/MyAlexNet.py", line 140, in model = tflearn.DNN(network, checkpointpath='checkpoint' + sExperimentName,max_checkpoints=1, tensorboard_verbose=2) File "/home/researcher/tens/local/lib/python2.7/site-packages/tflearn/models/dnn.py", line 57, in init session=session) File "/home/researcher/tens/local/lib/python2.7/site-packages/tflearn/helpers/trainer.py", line 111, in init clip_gradients) File "/home/researcher/tens/local/lib/python2.7/site-packages/tflearn/helpers/trainer.py", line 566, in initialize_training_ops self.grad = tf.gradients(total_loss, self.train_vars) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/ops/gradients.py", line 478, in gradients in_grads = _AsList(grad_fn(op, *out_grads)) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/ops/math_grad.py", line 637, in _MatMulGrad math_ops.matmul(op.inputs[0], grad, transpose_a=True)) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 1346, in matmul name=name) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1271, in _mat_mul transpose_b=transpose_b, name=name) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op op_def=op_def) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2310, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1232, in init self._traceback = _extract_stack() ...which was originally created as op u'nn_fc2/MatMul', defined at: File "/home/researcher/tens/MyAlexNet.py", line 139, in network=AlexNet() File "/home/researcher/tens/MyAlexNet.py", line 55, in AlexNet layer_fc2 = fully_connected(dropout1, 4096, activation='tanh', name='nn_fc2') File "/home/researcher/tens/local/lib/python2.7/site-packages/tflearn/layers/core.py", line 147, in fully_connected inference = tf.matmul(inference, W) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 1346, in matmul name=name) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1271, in _mat_mul transpose_b=transpose_b, name=name) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op op_def=op_def) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2310, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/researcher/tens/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1232, in init self._traceback = _extract_stack()

aymericdamien commented 7 years ago

You can load/save model at every loop to keep weights. Also note that DNN has its own session, so you do not need to run

        sess.run(tf.initialize_all_variables())

Also this is not needed, because you already encapsulate your graph:

tf.reset_default_graph()

shumailaahmed commented 3 years ago

i was running into same problem, what worked for me was having clean code, less reallocation of data subsets in different variables in loop, and deleting variables and cleaning memory at end of each k-fold ittr something like this:

#in imports
import gc

#for loop
 model.save(..)
 del fold
 del X
 del Y
 del TX 
 del TY 
 gc.collect()

tflearn / tflearn

Out of memory when training for k-fold Cross Validation #455