My network is:
a few dense layers (conv with padding + concatenating output to input),
2-layer LSTM and
2 Linear layers in the end.
Even after I made a network laughingly small, all GPU memory (8 GB) was consumed in a few epochs.
I understand that Apollo optimizer is quasi-Newton and attempts to approximate second derivative, but still - why memory consumption grows with every epoch?
I tried putting torch.cuda.empty_cache(), torch.clear_autocast_cache() (I didn't understand this, but who knows), gc.collect() - after each call consumption dropped a bit, but not so fast as Apollo took it :)
My network is: a few dense layers (conv with padding + concatenating output to input), 2-layer LSTM and 2 Linear layers in the end. Even after I made a network laughingly small, all GPU memory (8 GB) was consumed in a few epochs.
I understand that Apollo optimizer is quasi-Newton and attempts to approximate second derivative, but still - why memory consumption grows with every epoch? I tried putting
torch.cuda.empty_cache()
,torch.clear_autocast_cache()
(I didn't understand this, but who knows),gc.collect()
- after each call consumption dropped a bit, but not so fast as Apollo took it :)