tusen-ai / simpledet

A Simple and Versatile Framework for Object Detection and Instance Recognition
Apache License 2.0
3.09k stars 488 forks source link

[BUG] A lot of threads create and destroy when training cascade rcnn #272

Closed wkcn closed 4 years ago

wkcn commented 4 years ago

Describe the bug Hi, there. I found there are a lot of threads creating and destroying when training cascade rcnn. However, it is normal when training faster rcnn.

Reproduce Procedure:

MXNET_ENGINE_TYPE=NaiveEngine gdb python3
r detection_train.py --config config/cascade_rcnn/cascade_r101v1_fpn_1x.py

Which config are you using config/cascade_rcnn/cascade_r101v1_fpn_1x.py

Which dataset are you using MSCOCO

Software info Linux, CUDA 9 python: 3.6.6 MXNet: installed by pip https://github.com/TuSimple/simpledet/blob/master/doc/INSTALL.md

How did you set up your MXNet for SimpleDet

GDB will print a lot thread creating and destroying.

Additional context I set naive_engine mode for MXNet, in order to disable the creating of extra threads.

RogerChern commented 4 years ago

Issue confirmed

wkcn commented 4 years ago

Except of dataloader, is there any module using multiple threads in simpledet?

RogerChern commented 4 years ago

I think cascade r-cnn uses the same set of operators as vanilla fpn. But I do not observe any thread creation/destroy when gdbing fpn.

I will dig deeper.

RogerChern commented 4 years ago

After set breakpoint on the pthread_creat, I got

(gdb) bt
#0  __pthread_create_2_1 (newthread=0x7fffffff7578, attr=0x7ffff1434580, start_routine=0x7ffff12222e0, arg=0x7fffffff6bd0) at pthread_create.c:505
#1  0x00007ffff12229a0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00007fff3f09ac1f in ?? () from /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so
#3  0x00007fff3f0a42a9 in ?? () from /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so
#4  0x00007fff3ec4d377 in mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const () from /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so
#5  0x00007fff3eb8b454 in ?? () from /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so
#6  0x00007fff3eb90d5b in ?? () from /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so
#7  0x00007fff3eb8df9f in ?? () from /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so
#8  0x00007fff3ec4c7dd in mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&) ()
   from /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so
#9  0x00007fff3ec50ffc in mxnet::Imperative::InvokeOp(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode, mxnet::OpStatePtr) () from /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so
#10 0x00007fff3ec51deb in mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&) () from /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so
#11 0x00007fff3eb3efb9 in ?? () from /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so
#12 0x00007fff3eb3f5af in MXImperativeInvokeEx () from /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so
#13 0x00007ffff6911e20 in ffi_call_unix64 () from /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so
#14 0x00007ffff691188b in ffi_call () from /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so
#15 0x00007ffff690c01a in _ctypes_callproc () from /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so
#16 0x00007ffff68fffcb in ?? () from /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so
#17 0x00000000005c20e7 in PyObject_Call ()
#18 0x000000000053b656 in PyEval_EvalFrameEx ()
#19 0x00000000005401ef in ?? ()
#20 0x000000000053bc93 in PyEval_EvalFrameEx ()
#21 0x0000000000540b0b in PyEval_EvalCodeEx ()
#22 0x00000000004ec3f7 in ?? ()
#23 0x00000000005c20e7 in PyObject_Call ()
#24 0x0000000000538cab in PyEval_EvalFrameEx ()
#25 0x000000000053fc97 in ?? ()
#26 0x000000000053b83f in PyEval_EvalFrameEx ()
#27 0x000000000053b294 in PyEval_EvalFrameEx ()
#28 0x0000000000540b0b in PyEval_EvalCodeEx ()
#29 0x00000000004ec2e3 in ?? ()
#30 0x00000000005c20e7 in PyObject_Call ()
#31 0x00000000004fbfce in ?? ()
#32 0x00000000005c20e7 in PyObject_Call ()
#33 0x0000000000574db6 in ?? ()
#34 0x00000000005c20e7 in PyObject_Call ()
#35 0x000000000053b656 in PyEval_EvalFrameEx ()
#36 0x000000000054124a in PyEval_EvalCodeEx ()
#37 0x00000000004ec2e3 in ?? ()
#38 0x00000000005c20e7 in PyObject_Call ()
#39 0x0000000000534870 in PyEval_CallObjectWithKeywords ()
#40 0x00007ffff69063fd in ?? () from /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so

Noting very intuitive comes up. I will try with a debug bulid later today.

wkcn commented 4 years ago

Thank you so much! I notice simpledet only use multiple threads in dataloader, but it is not the reason. I will check whether MXNet create threads for CustomOp repeatly, since the input is variable.

wkcn commented 4 years ago

I use the latest code to build MXNet, and the issue has been addressed.

RogerChern commented 4 years ago

Glad to see the issue solved. I will bump up the prebuilt wheel version shortly.