Closed aggpankaj2 closed 6 years ago
for train_multi.py
you need to run with OpenMPI
as mpiexec -n <gpu_num> python train_multi.py
.
see https://github.com/knorth55/chainer-light-head-rcnn/tree/master/examples#command-1.
@knorth55 thanks for your reply. I able to run python3 eval_coco.py and got approx same MAP and MAR as you claimed . but for training on coco after adding mpiexec still getting segmentation fault
root@f05974bf4c4a:/dh/home/administrator/users_local/mamta/LightHead/chainer-light-head-rcnn-master/examples# **mpiexec --allow-run-as-root -n 1 python3 train_multi.py**
\Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/QTVSN6X4AXDIJMNLGSD6IEHOYI:/var/lib/docker/overlay2/l/QYLKVWNWNHYIEQP4W4BWSSTPW3:/var/lib/docker/overlay2/l/WI2GBLWIHNB543NQSSK7AKMLCG:/var/lib/docker/overlay2/l/GWC7AQOI4KOS6H37EUWY2RZWSI:/var/lib/docker/overlay2/l/HL5VDFW3HCJJBSB5OLQ7QADW23:/var/lib/docker/overlay2/l/PNTMHO5BDJLRSGTBRUZEQ73HD6:/var/lib/docker/overlay2/l/ZPWY2JADS66C4QL5GPXNSBOIYI:/var/lib/docker/overlay2/l/5JIJU7RG4LKAO5ZIGSS45QMWGS:/var/lib/docker/overlay2/l/HTWICUQOLX44F'
Unexpected end of /proc/mounts line `TCHYFFPWO3D7P:/var/lib/docker/overlay2/l/2QUEBMOYONLHTVMRGNENJZTKGO:/var/lib/docker/overlay2/l/CXWHQ4ZM6P3RPQ4DEF5KNRUV4O:/var/lib/docker/overlay2/l/56OISY77PE7HJ7ZBOJM3LPQUXT:/var/lib/docker/overlay2/l/JCLMNQ2RXZ5CJ7ETKT6I6DH67P:/var/lib/docker/overlay2/l/XIVJUUJ5ZXBIBDTSYKH5VKYY7F:/var/lib/docker/overlay2/l/CQK6H47ETB3WV4XGYOBIS4VC37:/var/lib/docker/overlay2/l/U2254MIWCELYJF4M2JWCZOIICB:/var/lib/docker/overlay2/l/ASXCFMVJFXJXM2MUOVXGFUUINY:/var/lib/docker/overlay2/l/SEPGTNLEC5UA3Q57WBXMC5XEUG:/var/lib/do'
Unexpected end of /proc/mounts line `cker/overlay2/l/IRQNNWIOSDTDUJOZ6HSEI3FB6K:/var/lib/docker/overlay2/l/IVDRXRFDBY3DXD4H3KPDM2ZH3Q:/var/lib/docker/overlay2/l/Z2OFTFK7QAV5MLRDCVB2YVRVLV:/var/lib/docker/overlay2/l/JG2NNHIWCI7ZPBQNKA7PGOFTQQ:/var/lib/docker/overlay2/l/5FXDT2BT5M6FXPTLHJCJUB4F44:/var/lib/docker/overlay2/l/OUW3B6TM3SFVEX2EK2QBH7I375:/var/lib/docker/overlay2/l/NNF4RKWHA5MR5I6T5IHDCD52B5:/var/lib/docker/overlay2/l/DBZIWLZZJDU52G4I7DM45TAEBO:/var/lib/docker/overlay2/l/URZQDD4SL5ZKKDIINZRGLUT7HF:/var/lib/docker/overlay2/l/TYPR2LEGF'
f05974bf4c4a:05892] *** Process received signal ***
[f05974bf4c4a:05892] Signal: Segmentation fault (11)
[f05974bf4c4a:05892] Signal code: Invalid permissions (2)
[f05974bf4c4a:05892] Failing at address: 0x1036a600000
[f05974bf4c4a:05892] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f972a2df390]
[f05974bf4c4a:05892] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x14e116)[0x7f972a052116]
[f05974bf4c4a:05892] [ 2] /usr/lib/libopen-pal.so.13(+0x2fcb7)[0x7f967e939cb7]
[f05974bf4c4a:05892] [ 3] /usr/lib/libmpi.so.12(ompi_datatype_sndrcv+0x54c)[0x7f967f07e6bc]
[f05974bf4c4a:05892] [ 4] /usr/lib/libmpi.so.12(MPI_Alltoall+0x16c)[0x7f967f08067c]
[f05974bf4c4a:05892] [ 5] /usr/local/lib/python3.5/dist-packages/mpi4py/MPI.cpython-35m-x86_64-linux-gnu.so(+0xec549)[0x7f967f3f2549]
[f05974bf4c4a:05892] [ 6] python3(PyCFunction_Call+0x77)[0x4e1117]
[f05974bf4c4a:05892] [ 7] python3(PyEval_EvalFrameEx+0x614)[0x5240b4]
[f05974bf4c4a:05892] [ 8] python3(PyEval_EvalFrameEx+0x49c4)[0x528464]
[f05974bf4c4a:05892] [ 9] python3(PyEval_EvalFrameEx+0x49c4)[0x528464]
[f05974bf4c4a:05892] [10] python3(PyEval_EvalCodeEx+0x13b)[0x52dd1b]
[f05974bf4c4a:05892] [11] python3[0x4e31c8]
[f05974bf4c4a:05892] [12] python3(PyObject_Call+0x47)[0x5b5da7]
[f05974bf4c4a:05892] [13] python3(PyEval_EvalFrameEx+0x26bd)[0x52615d]
[f05974bf4c4a:05892] [14] python3(PyEval_EvalFrameEx+0x49c4)[0x528464]
[f05974bf4c4a:05892] [15] python3(PyEval_EvalFrameEx+0x49c4)[0x528464]
[f05974bf4c4a:05892] [16] python3[0x52d45f]
[f05974bf4c4a:05892] [17] python3(PyEval_EvalFrameEx+0x509f)[0x528b3f]
[f05974bf4c4a:05892] [18] python3[0x52d45f]
[f05974bf4c4a:05892] [19] python3(PyEval_EvalFrameEx+0x54f3)[0x528f93]
[f05974bf4c4a:05892] [20] python3[0x52cf19]
[f05974bf4c4a:05892] [21] python3(PyEval_EvalCode+0x1f)[0x52dbcf]
[f05974bf4c4a:05892] [22] python3[0x601682]
[f05974bf4c4a:05892] [23] python3(PyRun_FileExFlags+0x9a)[0x603b2a]
[f05974bf4c4a:05892] [24] python3(PyRun_SimpleFileExFlags+0x1bc)[0x603d1c]
[f05974bf4c4a:05892] [25] python3(Py_Main+0x456)[0x63e756]
[f05974bf4c4a:05892] [26] python3(main+0xe1)[0x4cfbd1]
[f05974bf4c4a:05892] [27] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f9729f24830]
[f05974bf4c4a:05892] [28] python3(_start+0x29)[0x5d46c9]
[f05974bf4c4a:05892] *** End of error message ***
mpiexec noticed that process rank 0 with PID 5892 on node f05974bf4c4a exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
@aggpankaj2 There are several information and debugging I want you to do.
OpenMPI
is broken or well build. Did you build OpenMPI
with --with-cuda
flag?
Please try troubleshooting in ChainerMN
(https://chainermn.readthedocs.io/en/stable/installation/troubleshooting.html).batch_size
is 2
. Can you give me your GPU information? And also try with --batch-size 1
with train_multi.py
@knorth55 root@f05974bf4c4a:/dh/home/administrator/users_local/mamta/LightHead/chainer-light-head-rcnn-master# which mpicc /usr/local/bin/mpicc root@f05974bf4c4a:/dh/home/administrator/users_local/mamta/LightHead/chainer-light-head-rcnn-master# mpicc -show gcc -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -pthread -Wl,-rpath -Wl,/usr/lib/openmpi/lib -Wl,--enable-new-dtags -L/usr/lib/openmpi/lib -lmpi root@f05974bf4c4a:/dh/home/administrator/users_local/mamta/LightHead/chainer-light-head-rcnn-master# which mpiexec /usr/local/bin/mpiexec root@f05974bf4c4a:/dh/home/administrator/users_local/mamta/LightHead/chainer-light-head-rcnn-master# mpiexec --version mpiexec (OpenRTE) 1.10.2
also tried with batch size 1 but still getting same error
@aggpankaj2 how do you install mpiexec? with --with-cuda
flag?
@knorth55 yes i build with cuda https://chainermn.readthedocs.io/en/stable/installation/guide.html#mpi-install i used these three steps given in above link for openmpi-3.1.2
Open MPI (for details, see the official instructions): $ ./configure --with-cuda $ make -j4 $ sudo make install
Hmm, i have no idea... Have you passed all ChainerMN installation troubleshooting? it seems openmpi and mpi4py problems. Have you try installing mpi4py, chainermn and all others with conda?
Hi @knorth55 Please check all the step that i did for running demo. Please review it and suggest me where i am lacking.
1 pip3 install opencv-python cupy cupy-cuda80 chainer chainercv pillow cython
downloaded and then cd mpi4py 2.1 pip3 install -e .
apt-get install libopenmpi-dev
downloaded and then cd chainermn 4.1 pip3 install -e . 4.2 cd chainermn-master 4.3 mpiexec --allow-run-as-root -n 4 python3 examples/mnist/train_mnist.py able to run mpiexec --allow-run-as-root -n 4 command for mnist data
python3 demo.py image_path
Python3 train_multi.py
again got segmentation fault would you please tell me what step else need for train_multi.py
@knorth55 i am not using conda. is there need to install MVAPICH ( https://chainermn.readthedocs.io/en/stable/installation/guide.html#mpi-install )
apt-get install libopenmpi-dev
is this step necessary?
i install and compile openmpi with source.
also i install mpi4py with pip binary.
@knorth55 Thanks, Now training on coco database is going on. For Training on other database what possible changes i need to do ?????
@aggpankaj2 Good! What did you do for this issue? Can you tell me more detailed information about how to solve it?
In order to train with other dataset, you need to add make dataset class and modify train_multi.py
to load the dataset class.
@knorth55 I did same step only but on other linux machine. Steps are given below : pip3 install opencv-python pillow cupy-cuda80 cupy cd openmpi-3.1.2# ./configure --with-cuda && make -j4 && make install pip3 install chainer chainercv apt-get install libopenmpi-dev (necessary for MPI4py otherwise failed building wheel for mpi4py) pip3 install chainermn cd chainermn-master# pip install -e . cd mpi4py-3.0.0# python3 setup.py install (not did previously) pip3 install -U numpy pip3 install -e . These all step i followed
for other database - train_multi.py importing chainercv.datasets where python files regarding coco dataset is given. So where dataset class should be write (in chainercv.datasets)???????.
it is better to write your own dataset clas in your own package.
while training i am getting this error segmentation fault