Getting OOM error when using evaluate script.

rishizek / tensorflow-deeplab-v3-plus

DeepLabv3+ built in TensorFlow

MIT License

834 stars 307 forks source link

Getting OOM error when using evaluate script. #21

Open deep-unlearn opened 6 years ago

deep-unlearn commented 6 years ago

Hello,

I am getting an Out of Memory Error (OOM) when using the evaluate script. I have included all info in my stack overflow question here .

Any help will be appreciated.

rishizek commented 6 years ago

Hi @deep-unlearn , thank you for your interest in the repo.

I'm not sure what cased the out of memory error. Did you try running the evaluate script with GPU disabled? In theory, the evaluate script should run without GPU, though it takes longer time. Also, I noticed you are using python 2.7, with which I never tested. I only tested the repo with python 3 and this difference might cause the out of memory error.

I hope this can help you

deep-unlearn commented 6 years ago

Hello, Indeed I tried with no GPU enabled (CPU mode), however inference result is wrong. I guess this is caused due to the fact that training is on GPU hence model cannot work correctly on CPU. Probably training and inference have to be done using same mode. Training on CPU is not an option !

I have also tried python3.6 however same error occurs. Do you think it may due to memory leaking somehow ? When I use a single instance (1-label and 1-image) the code works fine and result is correct. Obviously the problem is somehow with larger input data.

I can help you fix/ improve the code but I do not where to start from. Any suggestions ?

rishizek commented 6 years ago

Hi @deep-unlearn ,

Hello, Indeed I tried with no GPU enabled (CPU mode), however inference result is wrong. I guess this is caused due to the fact that training is on GPU hence model cannot work correctly on CPU. Probably training and inference have to be done using same mode. Training on CPU is not an option !

That's strange because I can run inference script correctly without GPU, even though the model is trained on GPU. The training and inference do not have to be done using same mode. I'm curious what kind of error occurred when you run inference with CPU model.

Also which OS and TensorFlow version did you use to run the code?

deep-unlearn commented 6 years ago

Hello,

Ok interesting to know that inference can run on CPU as well (not for me though). With CPU mode I m not getting any error from the system however the outcome is wrong. All classes are predicted as class zero. When I try same script on single image on GPU (this does not produce as error OOM) outcome is perfect.

My system runs Ubuntu 17.10 I have multiple version of tensorflow through virtualenvs

Tested on Tensorflow 1.8 and 1.6 (python3)-- both cases same error I have CUDA 9.0 and CUDNN 6 installed

Which CUDNN version you have installed ? By he way I am testing the system with output_stride=8 which is more computational intensive

rishizek commented 6 years ago

Hi @deep-unlearn ,

I tested Tensorflow1.5,1.6,1.7 and 1.8 with ubuntu 16.04 Regarding cuda and cudnn, I confirm that the model works with cuda9.0 and 9.1, and cudnn 7 and 7.1. Maybe older version of cudnn 6 might be causing the problem. I usually testing model with output_stride=16, but output_stride=8 should work though computationally intensive.

deep-unlearn commented 6 years ago

Hello @rishizek

Thank you for your detailed help. I found out the problem eventually. OOM occurs when I provide a large image to the model (~5000x5000 pixel). I will try to catch the error and tile the image so I can re-ingest it in smaller patches.

Keep you informed on this, may be helpful for you or other users

rishizek commented 6 years ago

Hi @deep-unlearn ,

I see. That makes sense. Thank you for letting me know that!