Closed muditchaudhary closed 5 years ago
I face the same issue when trying to train the model.
https://github.com/open-mmlab/mmdetection/issues/24
The above link deals with the same issue. Currently the system has gcc --version = 4.8.5. I will try to update it to >5 and check if the problem persists.
I followed the following steps to remove the above errors but I am still receiving a different error.
conda install -c psi4 gcc-5 #Update to gcc 5
conda install -c anaconda libstdcxx-ng
After that I installed pytorch and mmdetection according to https://github.com/open-mmlab/mmdetection/blob/master/docs/INSTALL.md
Now, when I run dist_train.sh
or test.py
, I get the following error:
ImportError: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c1011CPUTensorIdEv
Cuda version 10.1
It looks like a cuda version error. Try to reinstall pytorch with other cuda version(9.0) use conda.
Regards, Ran
From: Mudit Chaudhary notifications@github.com Sent: Saturday, October 26, 2019 12:32 PM To: muditchaudhary/FYP_RepPoints FYP_RepPoints@noreply.github.com Cc: CHEN, Ran chenran1995@link.cuhk.edu.hk; Assign assign@noreply.github.com Subject: Re: [muditchaudhary/FYP_RepPoints] Model evaluation and Training error (#2)
I followed the following steps to remove the above errors but I am still receiving a different error.
After that I installed pytorch and mmdetection according to https://github.com/open-mmlab/mmdetection/blob/master/docs/INSTALL.md
Now, when I run dist_train.sh or test.py, I get the following error:
ImportError: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c1011CPUTensorIdEv
Cuda version 10.1
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/muditchaudhary/FYP_RepPoints/issues/2?email_source=notifications&email_token=AFUY23AMSU5CPDSIVSVUAOLQQPB43A5CNFSM4JFLMHI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECJ7NFI#issuecomment-546567829, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFUY23ADZLJRNNVIBW75P23QQPB43ANCNFSM4JFLMHIQ.
I set up the environment with following:
Pytorch: 1.1.0
cuda version: 9.0
gcc: 4.8.5
I tried to train the model using: ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate
and received the following error:
ImportErrorfrom . import deform_conv_cuda:
/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32EImportError
: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
The above error also appears in https://github.com/open-mmlab/mmdetection/issues/1554
Then I setup a separate environment with:
Pytorch: 1.1.0
cuda version: 9.0
gcc:5.4.0 #Supported by mmdetection according to INSTALL.md
(pytorch10) [mudit7@gpu38 RepPoints]$ conda list | grep cuda
cudatoolkit 9.0 h13b8566_0
pytorch 1.1.0 py3.7_cuda9.0.176_cudnn7.5.1_0 pytorch
(pytorch10) [mudit7@gpu38 RepPoints]$ gcc --version
gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc
I receive the same error again.
The detailed error log is as below:
(pytorch10) [mudit7@gpu38 RepPoints]$ ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate
Traceback (most recent call last):
Traceback (most recent call last):
File "./mmdetection/tools/train.py", line 9, in <module>
File "./mmdetection/tools/train.py", line 9, in <module>
from mmdet.apis import (get_root_logger, init_dist, set_random_seed,
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/__init__.py", line 2, in <module>
from mmdet.apis import (get_root_logger, init_dist, set_random_seed,
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/__init__.py", line 2, in <module>
from .inference import (inference_detector, init_detector, show_result, from .inference import (inference_detector, init_detector, show_result,
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/inference.py", line 10, in <module>
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/inference.py", line 10, in <module>
from mmdet.core import get_classes from mmdet.core import get_classes
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/__init__.py", line 6, in <module>
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/__init__.py", line 6, in <module>
from .post_processing import * # noqa: F401, F403
from .post_processing import * # noqa: F401, F403
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/__init__.py", line 1, in <module>
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/__init__.py", line 1, in <module>
from .bbox_nms import multiclass_nms
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/bbox_nms.py", line 3, in <module>
from .bbox_nms import multiclass_nms
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/bbox_nms.py", line 3, in <module>
from mmdet.ops.nms import nms_wrapper
from mmdet.ops.nms import nms_wrapper
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/__init__.py", line 2, in <module>
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/__init__.py", line 2, in <module>
from .dcn import (DeformConv, DeformConvPack, DeformRoIPooling, from .dcn import (DeformConv, DeformConvPack, DeformRoIPooling,
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/__init__.py", line 1, in <module>
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/__init__.py", line 1, in <module>
from .deform_conv import (DeformConv, DeformConvPack, ModulatedDeformConv,
from .deform_conv import (DeformConv, DeformConvPack, ModulatedDeformConv,
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 9, in <module>
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 9, in <module>
from . import deform_conv_cuda from . import deform_conv_cuda
ImportError:
/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
ImportError: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in <module>
main()
File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/research/byu2/mudit7/anaconda3/envs/pytorch10/bin/python', '-u', './mmdetection/tools/train.py', '--local_rank=0', './configs/reppoints_moment_r101_fpn_2x_mt.py', '--launcher', 'pytorch', '--validate']' returned non-zero exit status 1.
Can you try these settings: cudatoolkit 10.0.130 0 , pytorch 1.1.0 py3.7_cuda10.0.130_cudnn7.5.1_0 pytorch
From: Mudit Chaudhary notifications@github.com Sent: Saturday, October 26, 2019 4:16 PM To: muditchaudhary/FYP_RepPoints FYP_RepPoints@noreply.github.com Cc: CHEN, Ran chenran1995@link.cuhk.edu.hk; Assign assign@noreply.github.com Subject: Re: [muditchaudhary/FYP_RepPoints] Model evaluation and Training error (#2)
I set up the environment with following:
Pytorch: 1.1.0 cuda version: 9.0 gcc: 4.8.5
I tried to train the model using: ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate and received the following error:
ImportErrorfrom . import deform_conv_cuda: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32EImportError : /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
The above error also appears in open-mmlab/mmdetection#1554https://github.com/open-mmlab/mmdetection/issues/1554
Then I setup a separate environment with:
Pytorch: 1.1.0 cuda version: 9.0 gcc:5.4.0 #Supported by mmdetection according to INSTALL.md
(pytorch10) [mudit7@gpu38 RepPoints]$ conda list | grep cuda cudatoolkit 9.0 h13b8566_0 pytorch 1.1.0 py3.7_cuda9.0.176_cudnn7.5.1_0 pytorch
(pytorch10) [mudit7@gpu38 RepPoints]$ gcc --version gcc (GCC) 5.4.0 Copyright (C) 2015 Free Software Foundation, Inc
I receive the same error again.
The detailed error log is as below:
(pytorch10) [mudit7@gpu38 RepPoints]$ ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate
Traceback (most recent call last):
Traceback (most recent call last):
File "./mmdetection/tools/train.py", line 9, in
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/inference.py", line 10, in
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/init.py", line 6, in
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/init.py", line 1, in
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/muditchaudhary/FYP_RepPoints/issues/2?email_source=notifications&email_token=AFUY23EQQZ7LOPWG6OEWOTLQQP4HFA5CNFSM4JFLMHI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECKCZDA#issuecomment-546581644, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFUY23FMVAAQEAUF4DXQMLDQQP4HFANCNFSM4JFLMHIQ.
I just tested this version, it works well.
Make sure your link path of cuda version is correct.
Regards, Ran
From: CHEN, Ran chenran1995@link.cuhk.edu.hk Sent: Saturday, October 26, 2019 5:35 PM To: muditchaudhary/FYP_RepPoints reply@reply.github.com Subject: Re: [muditchaudhary/FYP_RepPoints] Model evaluation and Training error (#2)
Can you try these settings: cudatoolkit 10.0.130 0 , pytorch 1.1.0 py3.7_cuda10.0.130_cudnn7.5.1_0 pytorch
From: Mudit Chaudhary notifications@github.com Sent: Saturday, October 26, 2019 4:16 PM To: muditchaudhary/FYP_RepPoints FYP_RepPoints@noreply.github.com Cc: CHEN, Ran chenran1995@link.cuhk.edu.hk; Assign assign@noreply.github.com Subject: Re: [muditchaudhary/FYP_RepPoints] Model evaluation and Training error (#2)
I set up the environment with following:
Pytorch: 1.1.0 cuda version: 9.0 gcc: 4.8.5
I tried to train the model using: ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate and received the following error:
ImportErrorfrom . import deform_conv_cuda: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32EImportError : /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
The above error also appears in open-mmlab/mmdetection#1554https://github.com/open-mmlab/mmdetection/issues/1554
Then I setup a separate environment with:
Pytorch: 1.1.0 cuda version: 9.0 gcc:5.4.0 #Supported by mmdetection according to INSTALL.md
(pytorch10) [mudit7@gpu38 RepPoints]$ conda list | grep cuda cudatoolkit 9.0 h13b8566_0 pytorch 1.1.0 py3.7_cuda9.0.176_cudnn7.5.1_0 pytorch
(pytorch10) [mudit7@gpu38 RepPoints]$ gcc --version gcc (GCC) 5.4.0 Copyright (C) 2015 Free Software Foundation, Inc
I receive the same error again.
The detailed error log is as below:
(pytorch10) [mudit7@gpu38 RepPoints]$ ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate
Traceback (most recent call last):
Traceback (most recent call last):
File "./mmdetection/tools/train.py", line 9, in
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/inference.py", line 10, in
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/init.py", line 6, in
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/init.py", line 1, in
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/muditchaudhary/FYP_RepPoints/issues/2?email_source=notifications&email_token=AFUY23EQQZ7LOPWG6OEWOTLQQP4HFA5CNFSM4JFLMHI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECKCZDA#issuecomment-546581644, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFUY23FMVAAQEAUF4DXQMLDQQP4HFANCNFSM4JFLMHIQ.
I used the following settings:
(pytorch12) [mudit7@gpu38 RepPoints]$ conda list | grep cuda
cudatoolkit 10.0.130 0
pytorch 1.1.0 py3.7_cuda10.0.130_cudnn7.5.1_0 pytorch
gcc: 4.8.5
and got the following error:
(pytorch12) [mudit7@gpu38 RepPoints]$ ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate\
>
2019-10-26 19:19:22,384 - INFO - Distributed training: True
2019-10-26 19:19:23,699 - INFO - load model from: modelzoo://resnet101
2019-10-26 19:19:28,773 - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
loading annotations into memory...
loading annotations into memory...
Done (t=26.10s)
creating index...
Done (t=26.50s)
creating index...
index created!
index created!
loading annotations into memory...
loading annotations into memory...
Done (t=0.78s)
creating index...
index created!
Done (t=0.72s)
creating index...
index created!
2019-10-26 19:19:59,408 - INFO - Start running, host: mudit7@gpu38.cse.cuhk.edu.hk, work_dir: /research/byu2/mudit7/FYP/RepPoints/work_dirs/reppoints_moment_r101_fpn_2x_mt
2019-10-26 19:19:59,408 - INFO - workflow: [('train', 1)], max: 24 epochs
Traceback (most recent call last):
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in <module>
main()
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/research/byu2/mudit7/anaconda3/envs/pytorch12/bin/python', '-u', './mmdetection/tools/train.py', '--local_rank=0', './configs/reppoints_moment_r101_fpn_2x_mt.py', '--launcher', 'pytorch', '--validate']' died with <Signals.SIGSEGV: 11>.
I used the following settings (gcc update):
(pytorch12) [mudit7@gpu38 RepPoints]$ conda list | grep cuda
cudatoolkit 10.0.130 0
pytorch 1.1.0 py3.7_cuda10.0.130_cudnn7.5.1_0 pytorch
gcc: 5.4.0
and I get the same error as earlier.
Moreover, I noticed that RepPoints does not use the master branch for their implementation. So, I believe that it might also cause some problem.
@Lanselott, How can I check the link path of cuda and modify it if its not correct?
I used conda to install it using cudatoolkit
I am trying to test the RepPoints model (using the checkpoints provided by the authors).
I am using the following command:
But I am getting a segmentation fault with the following output:
I also tried to use
dist_train.sh
but still getting the segmentation fault.What can be the possible problem for this?