MEM/HANG: cuda out of memory

nankezi commented 7 years ago

hello, I used mxnet faster rcnn to train my own data (about 800 images)with alternative training, but it raised "out of memory" after rpn training . I used GTX 1080 and TITAN X, all failed.

ijkguo commented 7 years ago

Can you describe how out of memory happened? slowly climbed to max or exploded?

nankezi commented 7 years ago

When training rpn, it's ok, after that, generating detection , the memory usage began increasing quickly and exploded . It was very strange , sometimes it's ok and sometimes it raised "out of memory". mxnet.base.MXNetError: [21:45:16] src/storage/./pooled_storage_manager.h:79: cudaMalloc failed: out of memory. I have another problem . I used your new version with multi-gpu , training rpn was ok , when generating detection , it raised "AssertionError".

ijkguo commented 7 years ago

For the first problem, tune the environment variable MXNET_GPU_MEM_POOL_RESERVE. c.f. http://mxnet.io/how_to/env_var.html For the second problem, can you provide more detailed error message?

nankezi commented 7 years ago

@precedenceguo Traceback (most recent call last): File "train_alternate.py", line 101, in main() File "train_alternate.py", line 98, in main alternate_train(args, ctx, args.pretrained, args.epoch, args.rpn_epoch, args.rcnn_epoch) File "train_alternate.py", line 31, in alternate_train test_rpn(args, ctx[0], 'model/rpn1', rpn_epoch) File "/media/D/mx-rcnn/rcnn/tools/test_rpn.py", line 46, in test_rpn test_data=test_data, arg_params=arg_params, aux_params=aux_params) File "/media/D/mx-rcnn/rcnn/core/tester.py", line 20, in init self._mod.bind(test_data.provide_data, test_data.provide_label, for_training=False) File "/media/D/mx-rcnn/rcnn/core/module.py", line 137, in bind force_rebind=False, shared_module=None) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/module/module.py", line 263, in bind layout_mapper=self.layout_mapper) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/module/executor_group.py", line 154, in init self.label_layouts = self.decide_slices(label_shapes) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.7.0-py2.7.egg/mxnet/module/executor_group.py", line 168, in decide_slices assert len(data_shapes) > 0 AssertionError Thanks very much.

atticcas commented 7 years ago

@precedenceguo Hi, I have been running into issues with the newest version of this repo. To be more precise, I tried the MXNET_GPU_MEM_POOL_RESERVE turning, but it seems doesn't solve the problem. Running newest version mx-rcnn, I don't get out of memory error, instead it just hang when try to generating x/y proposal x varies differet time running the program.

I tried to track down where it hangs. seems like form module.py forward function(called from) Predictor predict

Any ideas are welcomed~ Thanks@!

ijkguo commented 7 years ago

Please try the following: Is demo all right? Is dmlc/mxnet/example/rcnn all right?

Unfortunately I could not reproduce your problem. The intention of using module is to stop the dependency of MXNET_GPU_MEM_POOL_RESERVE, so changing that is no longer necessary.

atticcas commented 7 years ago

The newest demo.py doesn't work properly.

For dmlc/mxnet/example/rcnn when running on large data set(such as higher resolution) I tested on 600600 800800, two resolution. when the data set is small(say 200 training pics) Everything seems to work just fine. But the hang happened when I increased the data set to nearly 1000 pictures. Seems less likely happen on GPU with larger ram(Titan), I have tested this on Titan and 1080.

Any suggestions about what can I try?

ijkguo commented 7 years ago

Allow me to rephrase,

Module-enabled version (this repo):

no mem problem
hang when generating x/y proposal (during training? demo using cxx proposal?)

Non-module version:

with mem problem (there is no module)
hang on large dataset (what behavior?)

atticcas commented 7 years ago

Sorry I didn't make it clear enough.

Module-enabled version (this repo):

no mem error
hang when generating x/y proposal (during training). only happen on the second test_rpn phase test_rpn(args, ctx[0], 'model/rpn2', rpn_epoch)
the hang doesn't happen all the time. from my experience, it's 50%? I was able to finish the training some times.
I will get back to demo.py problem tomorrow. need to check my config before get back to you.

Non-module version:

with mem probelm

atticcas commented 7 years ago

Hi, I tried your version of mxnet, it does solve the hang issue. (YEAH!) But when I tried this latest mx-rcnn version. It doesn't perform as well (way worse than the old version under mxnet/example) I try to keep all the config the same despite all the newly added config terms.

wondering if you have tested the robustness of this repo

also the demo is broken bescause it need the nms module which is named something else now nms_xxxxx.

ijkguo commented 7 years ago

Truthfully I have not because this is not a final release. mxnet/example/rcnn is good enough to reproduce the paper but I am working to improve it.

The demo problem will be fixed soon:

from rcnn.processing.nms import py_nms_wrapper, cpu_nms_wrapper, gpu_nms_wrapper
nms = py_nms_wrapper(NMS_THRESH)

Stay tuned.

atticcas commented 7 years ago

Thank you a lot for your hard work putting into this project. Anyway I can chip in? Since have spent long hours reading this project. Would love to help out a bit or two.

The problem I found out is that rpn network seems to be fine(the proposal area seems good) But the result from combined rcnn is not. when use test_rcnn --vis to visualize all the result. The boxed area doesn't make much sense.

Any idea where I should dig?

ijkguo commented 7 years ago

It is the bbox_regression_targets means and stds. checkout https://github.com/precedenceguo/mx-rcnn/blob/master/rcnn/tools/train_rcnn.py#L109

atticcas commented 7 years ago

Actually, I am a bit confused about the means and stds. In the old script, these 2 values were used to make adjustments to the bbox_weights and bbox_bias parameters after the rcnn1 model has been trained with 8 epoch. Not sure about the purpose behind this.

ijkguo commented 7 years ago

It is just a normalization process.

ijkguo commented 7 years ago

It is possible that MXNET_CPU_WORKER_NTHREADS must be greater than 1 for custom op to work on CPU, which caused the hang problem. Can you try to set this environment variable to larger values?

ijkguo commented 7 years ago

MXNet v0.9.2 fixed the customop problem. To use cpu demo, change ctx = mx.gpu(args.gpu) to ctx = mx.cpu(). Please be patient since cpu performance is much slower.

ijkguo / mx-rcnn

MEM/HANG: cuda out of memory #35