Open tanwimallick opened 5 years ago
Thanks for your kind information. I will investigate this issue. Besides, it is appreciated if you can provide more information, e.g., the error message, log, parameters, etc.
The error massage is: 2019-06-06 20:04:31.386792: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 43.75MiB. Current allocation summary follows. 2019-06-06 20:04:31.386936: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256): Total Chunks: 664, Chunks in use: 664. 166.0KiB allocated for chunks. 166.0KiB in use in bin. 8.9KiB client-requested in use in bin.
2019-06-06 20:04:31.396827: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at matmul_op.cc:478 : Resource exhausted: OOM when allocating tensor with shape[44800,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
I was trying to plot the memory consumption after each epoch. I got the following plot
The hyperparameter configuration was:
batch_size: 256, cl_decay_steps: 2000, filter_type: 'laplacian', horizon': 12, input_dim: 2, l1_decay': 0, max_diffusion_step: 1, num_nodes: 175, num_rnn_layers: 2, output_dim: 1, rnn_units: 64, seq_len: 12, use_curriculum_learning: True, base_lr: 0.01, epochs: 62, epsilon: 0.001, global_step: 0, lr_decay_ratio: 0.05, max_grad_norm: 9, max_to_keep: 100, min_learning_rate: 2e-06, optimizer': adagrad, patience: 50, steps: [20, 30, 40, 50], test_every_n_epochs: 10
I got the error after 30 epochs.
Is there any solution or suggestion? :)
It seems that the following codes will add nodes into computation graph per epoch. Every epoch we create new nodes in graph so that the graph will be larger and larger.
labels = model.labels[..., :output_dim]
loss = self._loss_fn(preds=preds, labels=labels)
A possible solution is that creating loss node in graph in class DCRNNModel
initialization instead of
in function run_epoch_generator
.
Any further updates on when this fix will be added?
It is better to define loss node in the graph in class DCRNNModel initialization. Then inside run_epoch_generator model.loss and model.mae can be used.
For a quick fix, I initialized the training and testing loss separately during the initialization of DCRNNSupervisor.
preds = self._train_model.outputs
labels = self._train_model.labels[..., :output_dim]
self.preds_test = self._test_model.outputs
self.labels_test = self._test_model.labels[..., :output_dim]
self._train_loss = self._loss_fn(preds=preds, labels=labels)
self._test_loss = self._loss_fn(preds=self.preds_test, labels=self.labels_test)
Inside run_epoch_generator:
if training:
fetches = {
'loss': self._train_loss,
'mae': self._train_loss,
'global_step': tf.train.get_or_create_global_step()
}
else:
fetches = {
'loss': self._test_loss,
'mae': self._test_loss,
'global_step': tf.train.get_or_create_global_step()
}
In the paper, how did you plot the learned localized filters centered at different nodes (Figure 7 in the paper)? Is that code available?
def run_epoch_generator(self, sess, model, data_generator, return_output=False, training=False, writer=None): output_dim = self._model_kwargs.get('output_dim') preds = model.outputs labels = model.labels[..., :output_dim] loss = self._loss_fn(preds=preds, labels=labels)
This part of the code has a memory leak. Getting OOM error after several epochs.