Open kirilllzaitsev opened 1 year ago
Cause of the problem could be:
@TheFusion21 , thank you for the suggestions. But I still can't explain why GPU usage stays at ~25%, while GPU memory (which is a blocker for using larger batch size, model, etc. due to out-of-memory errors) is almost 100% of available 24Gb.
In most cases, the major bottleneck is in the data loader. If your input images are too large or require something complex to do in the training phase, the GPU process should wait for the entire CPU process.
To resolve this issue, you can pre-process all the images before the training. (ex. making a smaller copy of training images -- 64x64 before you start the training.)
Or, if you use multi-node to train the model, the communication between the nodes might raise this issue. (it might be from the slow intra-network between the nodes)
Hi, I observe the following system metrics:
As I expect the GPU to be highly utilized given the used memory, what is the correct intuition for this metric to be that low? Does ImagenTrainer that I use creates puts loads of unused (during training) objects to GPU?