Apollo optimizer eats all the GPU memory

EmilPi commented 3 years ago

My network is: a few dense layers (conv with padding + concatenating output to input), 2-layer LSTM and 2 Linear layers in the end. Even after I made a network laughingly small, all GPU memory (8 GB) was consumed in a few epochs.

I understand that Apollo optimizer is quasi-Newton and attempts to approximate second derivative, but still - why memory consumption grows with every epoch? I tried putting torch.cuda.empty_cache(), torch.clear_autocast_cache() (I didn't understand this, but who knows), gc.collect() - after each call consumption dropped a bit, but not so fast as Apollo took it :)

mlw214 commented 3 years ago

I ran into this problem when I had set weight_decay > 0. Once I removed it memory usage was constant.

matthewdm0816 commented 3 years ago

Same here

jettify / pytorch-optimizer

Apollo optimizer eats all the GPU memory #306