Layerwise GPU memory use

Hi, I have a feeling that layerwise optimizer, by creating numerous networks is not freeing past networks and using more GPU memory than it should. I'm having a heck of time doing layerwise training

With this network:

inputs = 4096*2
win_size = 2048
swin_size = win_size / 2 + 1
output_size = swin_size
hidlayersize = win_size
exp = theanets.Experiment(theanets.Regressor,layers=[inputs, inputs, inputs/2, inputs/3, inputs/4, output_size, output_size])

With the following pretraining:

logging.info("Pretraining")
net.train([ttrain[0:1*trains/4], toutputs[0:1*trains/4]],
          [vtrain[0:1*trains/4], voutputs[0:1*trains/4]],
          algo='layerwise',
          learning_rate=1e-3,
          save_every=25,
          batch_size=32, # this is small!
          patience = 6,
          min_improvement = 0.1,
          save_progress="current_pre_brain.pkl",
          momentum=0.9)

I get the following error after training on layer hid1 and hid2 once it tries to train on hid3 it borks at validation.

I 2015-09-08 12:26:42 downhill.base:402 patience elapsed!
I 2015-09-08 12:26:42 theanets.layers.base:303 layer Feedforward "lwout": (hid3:out)2730 ->
1025, linear, 2799275 parameters
I 2015-09-08 12:26:42 theanets.trainer:250 layerwise: training in -> hid1 -> hid2 -> hid3 ->
 lwout
I 2015-09-08 12:26:43 downhill.base:378 -- patience = 6
I 2015-09-08 12:26:43 downhill.base:379 -- validate_every = 10
I 2015-09-08 12:26:43 downhill.base:380 -- min_improvement = 0.1
I 2015-09-08 12:26:43 downhill.base:381 -- max_gradient_norm = 0
I 2015-09-08 12:26:43 downhill.base:382 -- max_gradient_elem = 0
I 2015-09-08 12:26:43 downhill.base:383 -- learning_rate = 0.001
I 2015-09-08 12:26:43 downhill.base:384 -- momentum = 0.9
I 2015-09-08 12:26:43 downhill.base:385 -- nesterov = False
I 2015-09-08 12:26:43 downhill.adaptive:220 -- rms_halflife = 14
I 2015-09-08 12:26:43 downhill.adaptive:221 -- rms_regularizer = 1e-08
I 2015-09-08 12:26:43 downhill.base:112 compiling evaluation function
I 2015-09-08 12:26:43 downhill.base:118 compiling RMSProp function
Error allocating 11193000 bytes of device memory (out of memory). Driver report 966656 bytes
 free and 4294246400 bytes total
Traceback (most recent call last):
  File "stft-theanet.py", line 62, in <module>
    momentum=0.9)
  File "build/bdist.linux-x86_64/egg/theanets/graph.py", line 400, in train
  File "build/bdist.linux-x86_64/egg/theanets/graph.py", line 376, in itertrain
  File "build/bdist.linux-x86_64/egg/theanets/trainer.py", line 253, in itertrain
  File "build/bdist.linux-x86_64/egg/theanets/trainer.py", line 66, in itertrain
  File "/usr/local/lib/python2.7/dist-packages/downhill/base.py", line 388, in iterate
    self._compile()
  File "/usr/local/lib/python2.7/dist-packages/downhill/base.py", line 119, in _compile
    updates = list(self._updates) + list(self._get_updates())
  File "/usr/local/lib/python2.7/dist-packages/downhill/base.py", line 134, in _get_updates
    for var, expr in self._get_updates_for(param, grad):
  File "/usr/local/lib/python2.7/dist-packages/downhill/adaptive.py", line 226, in _get_upda
tes_for
    g2_tm1 = shared_like(param, 'g2_ewma')
  File "/usr/local/lib/python2.7/dist-packages/downhill/util.py", line 45, in shared_like
  File "/usr/local/lib/python2.7/dist-packages/theano/compile/sharedvalue.py", line 208, in
shared
    allow_downcast=allow_downcast, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/var.py", line 203, in flo
at32_shared_constructor
    deviceval = type_support_filter(value, type.broadcastable, False, None)
MemoryError: ('Error allocating 11193000 bytes of device memory (out of memory).', "you migh
t consider using 'theano.shared(..., borrow=True)'")

Yet if I just do training it works fine. It does use a lot of GPU memory, it's a big network and I have a lot of training examples.

batch_size = 4096 # way bigger!
logging.info("Finetune Training")
net.train([ttrain, toutputs],
          [vtrain, voutputs],
          algo='rmsprop',
          learning_rate=1e-4,
          save_every=25,
          batch_size=batch_size,
          patience = 100,
          min_improvement = 0.001,
          save_progress="current_brain.pkl",
          momentum=0.9)

My theory is that shared variables and whatnot are not being freed appropriately. I was looking at the code and new layers are being created but I cannot tell how much sharing or copying is being done.

lmjohns3 / theanets

Layerwise GPU memory use #98