Open crazyboy9103 opened 1 year ago
I haven't checked in detail, but from skimming your patch, it seems you have replaced batch processing with a loop. Meaning you are trading performance for memory.
As for the graphs, I don't know how the were created. It is odd to me that in both the memory changes quite heavily over time. For example, what is happening in the lower one, i.e. the one with your patch, at minute 10? And again at minute 20?
I haven't checked in detail, but from skimming your patch, it seems you have replaced batch processing with a loop. Meaning you are trading performance for memory.
Yes, by replacing the batch processing with a loop i was able to reduce the peak memory and avoid OOM.
As for the graphs, I don't know how the were created. It is odd to me that in both the memory changes quite heavily over time. For example, what is happening in the lower one, i.e. the one with your patch, at minute 10? And again at minute 20?
The graphs were automatically generated from wandb. I also felt that it was odd, but i haven't found a reason for that. I'm using pytorch lightning to train the model, and wandblogger to log metrics, images, etc. It seems like sth is causing a memory leak. I've reviewed my code for couple of days now but not found any parts of the code that can cause the odd behavior, as it's got nothing different from torchvision implementation.
Aside from the increase, _box_inter_union function has to be modified somehow as it certainly increases the peak memory for a large number of boxes and can potentially cause more frequent OOM.
I'll try to figure why GPU memory usage increases on my own, leaving the issue open.
🐛 Describe the bug
Training Faster R-CNN on large dataset (~1M of resolution 512x512) fails due to CUDA OOM in RPN. These are the hyperparameters for the experiment:
_box_inter_union function in torchvision/ops/boxes.py seems like it consumes a large amount of memory in the tensor operations, when len(boxes1) and len(boxes2) are large. I have altered the code as following to resolve the issue:
Below are GPU memory usage before and after the modification.
It is very odd that the memory usage even increases, but before the modification, GPU memory usage continues to increase. This is clearly not expected behaviour. Can anyone help me figure out what is going on?
Versions
[pip3] numpy==1.25.2 [pip3] pytorch-lightning==2.0.8 [pip3] torch==2.0.1 [pip3] torchinfo==1.8.0 [pip3] torchmetrics==1.0.2 [pip3] torchvision==0.15.2 [pip3] triton==2.0.0