Out of Memory during sample generation in validation_step.
Error
# print(i)
# print(torch.cuda.memory_allocated())
3487
13751406080
3488
13755338240
3489
13759270400
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.73 GiB total capacity; 12.81 GiB already allocated; 11.88 MiB free; 13.67 GiB reserved in total by PyTorch)
Gradually memory consumed, then OoM error.
Condition
version: at least bdf0c92cd6167ad47b810f560aa7d95bf9ff49f2
execution env: Google Colab, NVIDIA T4 GPU
Analysis
Error
Memory increase
Increase 3.93216 MB per loop.
Debugs
gradient accumulation in RNN
no_grad is used, so maybe no grad accumulation problem.
Generated sample
Even temporally disable sample_series = torch.cat ..., OoM (above error is this no sample condition).
So maybe no problem in sample stack (In theory, LongTensor series is very small).
AR
Even disable autoregressive output/input and hidden, OoM occur same as above.
steps
# print(f"Decoder start: {torch.cuda.memory_allocated()}")
Decoder start: 28429824 # == 28 MB
# print(f"before AR loop: {torch.cuda.memory_allocated()}")
before AR loop: 28433920 # == 28 MB
increase in "output" is equal to increase per loop "3.93216MB" !
If fc part is separated as fc1 -> hidden -> fc2, increase memory step by step (total increase is totally same).
Empty loop
Empty loop do not increase memory (stay 28433920 Byte == 28 MB).
Only embedding & RNN
Only embedding & RNN work without OoM.
Emb + RNN + hidden AR
Works well.
With FC eval mode
Even with fc1.eval() and fc2.eval(), tot works.
+3932160 Byte per loop, same as no explicit eval().
With only ReLU
Works well.
With only fc1
fc1 is ON, fc2 is off.
1835008 Byte per loop, about half of fc1+fc2
Independent fc input
Even input is generated independently with RNN output, memory increase.
Summary
Out of Memory during sample generation in
validation_step
.Error
Gradually memory consumed, then OoM error.
Condition
Analysis
Error
Memory increase
Increase 3.93216 MB per loop.
Debugs
gradient accumulation in RNN
no_grad
is used, so maybe no grad accumulation problem.Generated sample
Even temporally disable
sample_series = torch.cat ...
, OoM (above error is this no sample condition).So maybe no problem in sample stack (In theory, LongTensor series is very small).
AR
Even disable autoregressive output/input and hidden, OoM occur same as above.
steps
Apparently the problem is in AR generation loop.
increase in "output" is equal to increase per loop "3.93216MB" !
If fc part is separated as fc1 -> hidden -> fc2, increase memory step by step (total increase is totally same).
Empty loop
Empty loop do not increase memory (stay 28433920 Byte == 28 MB).
Only embedding & RNN
Only embedding & RNN work without OoM.
Emb + RNN + hidden AR
Works well.
With FC eval mode
Even with
fc1.eval()
andfc2.eval()
, tot works.+3932160 Byte per loop, same as no explicit eval().
With only ReLU
Works well.
With only fc1
fc1 is ON, fc2 is off.
Independent fc input
Even input is generated independently with RNN output, memory increase.
So what...?