OoM in sample generation

Summary

Out of Memory during sample generation in validation_step.

Error

# print(i)
# print(torch.cuda.memory_allocated())

3487
13751406080
3488
13755338240
3489
13759270400

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.73 GiB total capacity; 12.81 GiB already allocated; 11.88 MiB free; 13.67 GiB reserved in total by PyTorch)

Gradually memory consumed, then OoM error.

Condition

version: at least bdf0c92cd6167ad47b810f560aa7d95bf9ff49f2
execution env: Google Colab, NVIDIA T4 GPU

Analysis

Error

Memory increase

Increase 3.93216 MB per loop.

Debugs

gradient accumulation in RNN

no_grad is used, so maybe no grad accumulation problem.

Generated sample

Even temporally disable sample_series = torch.cat ..., OoM (above error is this no sample condition).
So maybe no problem in sample stack (In theory, LongTensor series is very small).

AR

Even disable autoregressive output/input and hidden, OoM occur same as above.

steps

# print(f"Decoder start: {torch.cuda.memory_allocated()}")
Decoder start: 28429824 # == 28 MB
# print(f"before AR loop: {torch.cuda.memory_allocated()}")
before AR loop: 28433920 # == 28 MB

Apparently the problem is in AR generation loop.

loop end:        13751402496

loop start:      13751402496 (±0)
embedded:        13751402496 (±0)
cell executed:   13751404544 (+2048)
output:          13755336704 (+3932160)
softmaxed:       13755336704 (±0)
categoricalized: 13755336704 (±0)
sampled:         13755337216 (+512)
sample t -> t-1: 13755336704 (-512) (※back to "categoricalized")
loop end:        13755334656 (-2048)

increase in "output" is equal to increase per loop "3.93216MB" !
If fc part is separated as fc1 -> hidden -> fc2, increase memory step by step (total increase is totally same).

Empty loop

Empty loop do not increase memory (stay 28433920 Byte == 28 MB).

Only embedding & RNN

Only embedding & RNN work without OoM.

Emb + RNN + hidden AR

Works well.

With FC eval mode

Even with fc1.eval() and fc2.eval(), tot works.
+3932160 Byte per loop, same as no explicit eval().

With only ReLU

Works well.

With only fc1

fc1 is ON, fc2 is off.

1835008 Byte per loop, about half of fc1+fc2

Independent fc input

Even input is generated independently with RNN output, memory increase.

So what...?

tarepan / UniversalVocoding