liyaguang / DCRNN

Implementation of Diffusion Convolutional Recurrent Neural Network in Tensorflow
MIT License
1.21k stars 400 forks source link

Memory leak #33

Open tanwimallick opened 5 years ago

tanwimallick commented 5 years ago

def run_epoch_generator(self, sess, model, data_generator, return_output=False, training=False, writer=None): output_dim = self._model_kwargs.get('output_dim') preds = model.outputs labels = model.labels[..., :output_dim] loss = self._loss_fn(preds=preds, labels=labels)

This part of the code has a memory leak. Getting OOM error after several epochs.

liyaguang commented 5 years ago

Thanks for your kind information. I will investigate this issue. Besides, it is appreciated if you can provide more information, e.g., the error message, log, parameters, etc.

tanwimallick commented 5 years ago

The error massage is: 2019-06-06 20:04:31.386792: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 43.75MiB. Current allocation summary follows. 2019-06-06 20:04:31.386936: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256): Total Chunks: 664, Chunks in use: 664. 166.0KiB allocated for chunks. 166.0KiB in use in bin. 8.9KiB client-requested in use in bin.

2019-06-06 20:04:31.396827: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at matmul_op.cc:478 : Resource exhausted: OOM when allocating tensor with shape[44800,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

I was trying to plot the memory consumption after each epoch. I got the following plot OOM

The hyperparameter configuration was:

batch_size: 256, cl_decay_steps: 2000, filter_type: 'laplacian', horizon': 12, input_dim: 2, l1_decay': 0, max_diffusion_step: 1, num_nodes: 175, num_rnn_layers: 2, output_dim: 1, rnn_units: 64, seq_len: 12, use_curriculum_learning: True, base_lr: 0.01, epochs: 62, epsilon: 0.001, global_step: 0, lr_decay_ratio: 0.05, max_grad_norm: 9, max_to_keep: 100, min_learning_rate: 2e-06, optimizer': adagrad, patience: 50, steps: [20, 30, 40, 50], test_every_n_epochs: 10

I got the error after 30 epochs.

ivechan commented 5 years ago

Is there any solution or suggestion? :)

ivechan commented 5 years ago

It seems that the following codes will add nodes into computation graph per epoch. Every epoch we create new nodes in graph so that the graph will be larger and larger.

labels = model.labels[..., :output_dim]
loss = self._loss_fn(preds=preds, labels=labels)

A possible solution is that creating loss node in graph in class DCRNNModel initialization instead of in function run_epoch_generator.

parkitny commented 5 years ago

Any further updates on when this fix will be added?

tanwimallick commented 5 years ago

It is better to define loss node in the graph in class DCRNNModel initialization. Then inside run_epoch_generator model.loss and model.mae can be used.

For a quick fix, I initialized the training and testing loss separately during the initialization of DCRNNSupervisor.

preds = self._train_model.outputs
labels = self._train_model.labels[..., :output_dim]

self.preds_test = self._test_model.outputs
self.labels_test = self._test_model.labels[..., :output_dim]

self._train_loss = self._loss_fn(preds=preds, labels=labels)
self._test_loss = self._loss_fn(preds=self.preds_test, labels=self.labels_test)

Inside run_epoch_generator:

if training:
             fetches = {
                 'loss': self._train_loss,
                 'mae': self._train_loss,
                 'global_step': tf.train.get_or_create_global_step()
             }
else:
            fetches = {
                 'loss': self._test_loss,
                 'mae': self._test_loss,
                 'global_step': tf.train.get_or_create_global_step()
            }

In the paper, how did you plot the learned localized filters centered at different nodes (Figure 7 in the paper)? Is that code available?