Closed nankezi closed 7 years ago
Can you describe how out of memory happened? slowly climbed to max or exploded?
When training rpn, it's ok, after that, generating detection , the memory usage began increasing quickly and exploded . It was very strange , sometimes it's ok and sometimes it raised "out of memory". mxnet.base.MXNetError: [21:45:16] src/storage/./pooled_storage_manager.h:79: cudaMalloc failed: out of memory. I have another problem . I used your new version with multi-gpu , training rpn was ok , when generating detection , it raised "AssertionError".
For the first problem, tune the environment variable MXNET_GPU_MEM_POOL_RESERVE. c.f. http://mxnet.io/how_to/env_var.html For the second problem, can you provide more detailed error message?
@precedenceguo Traceback (most recent call last):
File "train_alternate.py", line 101, in
@precedenceguo Hi, I have been running into issues with the newest version of this repo. To be more precise, I tried the MXNET_GPU_MEM_POOL_RESERVE turning, but it seems doesn't solve the problem. Running newest version mx-rcnn, I don't get out of memory error, instead it just hang when try to generating x/y proposal x varies differet time running the program.
I tried to track down where it hangs. seems like form module.py forward function(called from) Predictor predict
Any ideas are welcomed~ Thanks@!
Please try the following: Is demo all right? Is dmlc/mxnet/example/rcnn all right?
Unfortunately I could not reproduce your problem. The intention of using module is to stop the dependency of MXNET_GPU_MEM_POOL_RESERVE, so changing that is no longer necessary.
The newest demo.py doesn't work properly.
For dmlc/mxnet/example/rcnn when running on large data set(such as higher resolution) I tested on 600600 800800, two resolution. when the data set is small(say 200 training pics) Everything seems to work just fine. But the hang happened when I increased the data set to nearly 1000 pictures. Seems less likely happen on GPU with larger ram(Titan), I have tested this on Titan and 1080.
Any suggestions about what can I try?
Allow me to rephrase,
Module-enabled version (this repo):
Non-module version:
Sorry I didn't make it clear enough.
Module-enabled version (this repo):
no mem error
hang when generating x/y proposal (during training). only happen on the second test_rpn phase test_rpn(args, ctx[0], 'model/rpn2', rpn_epoch)
the hang doesn't happen all the time. from my experience, it's 50%? I was able to finish the training some times.
I will get back to demo.py problem tomorrow. need to check my config before get back to you.
Non-module version:
Hi, I tried your version of mxnet, it does solve the hang issue. (YEAH!) But when I tried this latest mx-rcnn version. It doesn't perform as well (way worse than the old version under mxnet/example) I try to keep all the config the same despite all the newly added config terms.
wondering if you have tested the robustness of this repo
also the demo is broken bescause it need the nms module which is named something else now nms_xxxxx.
Truthfully I have not because this is not a final release. mxnet/example/rcnn is good enough to reproduce the paper but I am working to improve it.
The demo problem will be fixed soon:
from rcnn.processing.nms import py_nms_wrapper, cpu_nms_wrapper, gpu_nms_wrapper
nms = py_nms_wrapper(NMS_THRESH)
Stay tuned.
Thank you a lot for your hard work putting into this project. Anyway I can chip in? Since have spent long hours reading this project. Would love to help out a bit or two.
The problem I found out is that rpn network seems to be fine(the proposal area seems good) But the result from combined rcnn is not. when use test_rcnn --vis to visualize all the result. The boxed area doesn't make much sense.
Any idea where I should dig?
It is the bbox_regression_targets means and stds. checkout https://github.com/precedenceguo/mx-rcnn/blob/master/rcnn/tools/train_rcnn.py#L109
Actually, I am a bit confused about the means and stds. In the old script, these 2 values were used to make adjustments to the bbox_weights and bbox_bias parameters after the rcnn1 model has been trained with 8 epoch. Not sure about the purpose behind this.
It is just a normalization process.
It is possible that MXNET_CPU_WORKER_NTHREADS must be greater than 1 for custom op to work on CPU, which caused the hang problem. Can you try to set this environment variable to larger values?
MXNet v0.9.2 fixed the customop problem. To use cpu demo, change ctx = mx.gpu(args.gpu)
to ctx = mx.cpu()
. Please be patient since cpu performance is much slower.
hello, I used mxnet faster rcnn to train my own data (about 800 images)with alternative training, but it raised "out of memory" after rpn training . I used GTX 1080 and TITAN X, all failed.