tensorflow / models

Models and examples built with TensorFlow
Other
76.97k stars 45.79k forks source link

OOM when testing custom model #8317

Open etatbak opened 4 years ago

etatbak commented 4 years ago

I trained faster_rcnn_nas model with my custom dataset (resized images 1280x1080). My GPU is Nvidia Quadro P5000 and I can test the model on this computer. When I test with GTX 1060 it crashes and gives memory error. But when I test pre-trained faster_rcnn_nas it works fine. Full error: ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[500,4032,17,17] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node MaxPool2D/MaxPool-0-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/Sum/_275]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[500,4032,17,17] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node MaxPool2D/MaxPool-0-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored.

proxip commented 4 years ago

try to lower your batch_size become 6 or lower

etatbak commented 4 years ago

try to lower your batch_size become 6 or lower

But for prediction, is there any batch_size? Because I am not training, I am testing.

etatbak commented 4 years ago

try to lower your batch_size become 6 or lower

But for prediction, is there any batch_size? Because I am not training, I am testing.

@proxip ??

proxip commented 4 years ago

try to restart your system, I can running again while restarting the system

pkulzc commented 4 years ago

Did you train your own model? If you have more than 90 classes (used in pretrained ckpt) or if you have changed any configs, your mem footprint may increase.

etatbak commented 4 years ago

Did you train your own model? If you have more than 90 classes (used in pretrained ckpt) or if you have changed any configs, your mem footprint may increase.

@pkulzc Yes I trained my own model with only 2 classes. To be honest I reduced image sizes to 1080x1200 (original sizes: 1200x1200 at config). I don't think these changes effect to the performance, or not?

ColinHsiao commented 4 years ago

I have same problem, my batch_size is 5