jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
651 stars 150 forks source link

GPU not being utilized? #37

Open ohanoch opened 3 years ago

ohanoch commented 3 years ago

Hi all, This might just be me misunderstanding things but I wanted to ask about the GPU utilization. What I use I am using Ubuntu 18.04 and nvidia-smi to monitor my GPU I have an NVIDIA RTX 2080Ti on my machine

What I noticed Volatile GPU-Util seems to be on 0% most of the time, occasionally jumping to 50-80% for a second and then going back down to 0%. This corresponds to the power usage and temperature spiking for these seconds but for the most part staying low. This being said the power usage when training is around 60W as opposed to 10W when idle and the temperature is around 47C as opposed to 32C when idle, so the GPU is definitely being used. Also when I run the training I see the GPU memory being used as I would expect, and I can see the python process running on it.

What I expect When I ran other models like Tacotron in the past I would have my Volatile GPU-Util at 70-98% use all the time. This is what I thought should happen here as well

The Question Is there a reason my Volatile GPU-Util isn't being used? Is there something I am doing wrong? Does this mean I am leaving performance laying around that I can leverage somehow? If so, how? Is this happening to anyone else?

Please let me know if you need any more information from me. Thank you in advance for your time and any assistance available!

ohanoch commented 3 years ago

I just started trainer on a larger dataset and it seems the GPU is being used properly (~68% with dips and spikes).

Still this raises the question if there is a way to maybe make performance increase for smaller datasets. Right now I am still testing out how to preprocess my datasets and which to use, but once I am at the point of optimizing runtime I will post if I come to any conclusions. I will be happy to hear if anyone has found anything or has ideas!

Thanks again for your time!

dechubby commented 3 years ago

I'm just getting used to this, so this may be more of a discussion than an answer, but how about the GPU memory utilization during training? And have you tried tuning the batch size?

ohanoch commented 3 years ago

Thanks for your reply! Indeed this has become more of a discussion topic than a problem. As I mentioned the GPU memory usage is as expected, around 5-6GB which is around half of my GPU memory size, large enough of an amount for me to not want to double my batch size because I don't want it to randomly exceed the amount I have and crash.

v-nhandt21 commented 3 years ago

Thanks for your reply! Indeed this has become more of a discussion topic than a problem. As I mentioned the GPU memory usage is as expected, around 5-6GB which is around half of my GPU memory size, large enough of an amount for me to not want to double my batch size because I don't want it to randomly exceed the amount I have and crash.

Have you found the answer, I have a same problem, 0% most of the time?

I think the answer is from this issue: https://github.com/jaywalnut310/glow-tts/issues/16

He said that: "The monotonic alignment search always operates on CPU cores, which means if the number of CPU cores does not increase with the number of gpus proportionally, CPU could take much time to search alignments for the 4 times larger batch than base setting."

Can you share your experiment to improve this or you still train it anyway?

ohanoch commented 3 years ago

I am sorry for my late response, I just honestly don't remember. I think what solved it was the fact that I was using a database that was too small. Once I used a larger DB I think it worked fine. I am still not sure why it doesn't utilize the GPU with a small DB because how I understand it only a single batch should running at a given time, so the batch size should determine the GPU usage (which it does) and the DB size should be irrelevant since the DB itself is never fully loaded to the GPU memory (which I am apparently wrong about?)