GPU memory usage VS GPU utilization

lucidrains / imagen-pytorch

Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch

MIT License

8.09k stars 768 forks source link

GPU memory usage VS GPU utilization #348

Open kirilllzaitsev opened 1 year ago

kirilllzaitsev commented 1 year ago

Hi, I observe the following system metrics:

As I expect the GPU to be highly utilized given the used memory, what is the correct intuition for this metric to be that low? Does ImagenTrainer that I use creates puts loads of unused (during training) objects to GPU?

TheFusion21 commented 1 year ago

Cause of the problem could be:

Loading of batches is slow
Model is to small to utilize the entire GPU ( increase batch size)
some different bottleneck (cpu, pci-e link, etc)

kirilllzaitsev commented 1 year ago

@TheFusion21 , thank you for the suggestions. But I still can't explain why GPU usage stays at ~25%, while GPU memory (which is a blocker for using larger batch size, model, etc. due to out-of-memory errors) is almost 100% of available 24Gb.

FriedRonaldo commented 1 year ago

In most cases, the major bottleneck is in the data loader. If your input images are too large or require something complex to do in the training phase, the GPU process should wait for the entire CPU process.

To resolve this issue, you can pre-process all the images before the training. (ex. making a smaller copy of training images -- 64x64 before you start the training.)

Or, if you use multi-node to train the model, the communication between the nodes might raise this issue. (it might be from the slow intra-network between the nodes)