Different Languages us different amount of GPU memory

Hi all,

Problem description

I am running into a strange behavior that I can't understand where I try to train GlowTTS on roughly the same size DB (in terms of seconds of audio, but different amounts of files) but in 3 different languages and I get very different amounts of GPU memory utilization. The GlowTTS code stays the same except I change "text_to_sequence" function in init.py in the text directory to my own thing, that in the end still sends the results to "phoneme_to_sequence" function.
I normalize all DBs wavs to be single channel with the same rate 22050 and the same normalization scale. The dictionaries for each language is obviously different but I double checked and the phonemes for each language are generated properly and only use normal characters (a-z,A-Z) + "#" for emphasis (like in cmudict). The only difference I can think of between the dictionaries is the amount of distinct phonemes in them, some languages have more phonemes than others.

For the same batch size one of the languages will take ~7-8 GB of memory while the second will take ~15GB and and the third will start at ~4GB but slowly crawls up until by epoch 40 it takes over 32GB of memory and crashes (as that is all I have available)

In the 3rd language I also noticed that I get a lot more "Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to XXXX" then with other languages. I know this is ok to happen a bit at the beginning of training mostly and it happens in all languages, but with the 3rd language it gets down to 512 right at the beginning of training and to 64 with in a few epochs. In my experience this usually means I have some problem with my DB being inconsistent or flawed, but after checking my DB it seems fine.

What I expect

I expect the main contribute to GPU memory utilization to be the batch size. I do not understand what else affects the amount of GPU memory used, and why different dictionaries and amount of distinct phonemes or things of this nature should have any effect. Obviously I am misunderstanding something.

What I did

Up until now I managed to bring down the batch size for the different languages in order to get results. This is not an ideal solution but it sort of worked. My problem is that now with the third language I brought down the batch size too the minimum I feel comfortable with and it still crashes because of lack of memory.

I was thinking of resetting the GPU memory after each batch (so only a single batch is stored at a time) but I was not sure how to do that or if it will have any other side effects, as clearly the GPU is currently storing more than just a single batch worth of data.

What I suspect

I suspect that either the change in number of phonemes per dictionary causes the problem, or something about the difference in the DBs that I cant think of (number of files in a DB maybe?) causes the problem.

Other things I have noticed In the past

I noticed in the past that a small DB will cause little to no use of the GPU (see issue #37 ) which doesn't make sense to me either because I would think the GPU stores a batch at a time, so DB size shouldn't matter but only batch size should. This is not the case leading me to understand I have a misunderstanding as to what gets stored and run on the GPU at a given time, this misunderstanding might also be the root of this problem here now.

Summary of Questions

What may be causing different languages to use different amount of GPU memory?
What gets stored on the GPU at any given time? Is there a way to tell the GPU to rest its memory after each batch? will this have side effects I cant think of?
Does the shape of the DB have an impact on GPU memory usage (amount of files, length of files, etc)?

Thank you in advance for your time and any help you may be able to provide.

jaywalnut310 / glow-tts