tarepan / UniversalVocoding

A PyTorch implementation of "Robust Universal Neural Vocoding"
https://tarepan.github.io/UniversalVocoding
MIT License
2 stars 1 forks source link

OoM in sample generation #3

Closed tarepan closed 3 years ago

tarepan commented 3 years ago

Summary

Out of Memory during sample generation in validation_step.

Error

# print(i)
# print(torch.cuda.memory_allocated())

3487
13751406080
3488
13755338240
3489
13759270400

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.73 GiB total capacity; 12.81 GiB already allocated; 11.88 MiB free; 13.67 GiB reserved in total by PyTorch)  

Gradually memory consumed, then OoM error.

Condition

Analysis

Error

Memory increase

Increase 3.93216 MB per loop.

Debugs

gradient accumulation in RNN

no_grad is used, so maybe no grad accumulation problem.

Generated sample

Even temporally disable sample_series = torch.cat ..., OoM (above error is this no sample condition).
So maybe no problem in sample stack (In theory, LongTensor series is very small).

AR

Even disable autoregressive output/input and hidden, OoM occur same as above.

steps

# print(f"Decoder start: {torch.cuda.memory_allocated()}")
Decoder start: 28429824 # == 28 MB
# print(f"before AR loop: {torch.cuda.memory_allocated()}")
before AR loop: 28433920 # == 28 MB

Apparently the problem is in AR generation loop.

loop end:        13751402496

loop start:      13751402496 (±0)
embedded:        13751402496 (±0)
cell executed:   13751404544 (+2048)
output:          13755336704 (+3932160)
softmaxed:       13755336704 (±0)
categoricalized: 13755336704 (±0)
sampled:         13755337216 (+512)
sample t -> t-1: 13755336704 (-512) (※back to "categoricalized")
loop end:        13755334656 (-2048)

increase in "output" is equal to increase per loop "3.93216MB" !
If fc part is separated as fc1 -> hidden -> fc2, increase memory step by step (total increase is totally same).

Empty loop

Empty loop do not increase memory (stay 28433920 Byte == 28 MB).

Only embedding & RNN

Only embedding & RNN work without OoM.

Emb + RNN + hidden AR

Works well.

With FC eval mode

Even with fc1.eval() and fc2.eval(), tot works.
+3932160 Byte per loop, same as no explicit eval().

With only ReLU

Works well.

With only fc1

fc1 is ON, fc2 is off.

Independent fc input

Even input is generated independently with RNN output, memory increase.

So what...?

tarepan commented 3 years ago

AMP cause error.
Make issue in PL (PyTorch-Lightning#5559).