Model evaluation and Training error

muditchaudhary commented 5 years ago

I am trying to test the RepPoints model (using the checkpoints provided by the authors).

I am using the following command:

python ./mmdetection/tools/test.py configs/reppoints_moment_r50_fpn_2x_mt.py reppoints_moment_r50_fpn_2x_mt.pth --out results.pkl --eval bbox

But I am getting a segmentation fault with the following output:

(pytorch) [mudit7@gpu39 RepPoints]$ python ./mmdetection/tools/test.py configs/reppoints_moment_r50_fpn_2x_mt.py reppoints_moment_r50_fpn_2x_mt.pth --out results.pkl --show 
loading annotations into memory...
Done (t=0.72s)
creating index...
index created!
[                                                  ] 0/5000, elapsed: 0s, ETA:Segmentation fault (core dumped)

I also tried to use dist_train.sh but still getting the segmentation fault.

What can be the possible problem for this?

muditchaudhary commented 5 years ago

I face the same issue when trying to train the model.

https://github.com/open-mmlab/mmdetection/issues/24

The above link deals with the same issue. Currently the system has gcc --version = 4.8.5. I will try to update it to >5 and check if the problem persists.

muditchaudhary commented 5 years ago

I followed the following steps to remove the above errors but I am still receiving a different error.

conda install -c psi4 gcc-5 #Update to gcc 5
conda install -c anaconda libstdcxx-ng

After that I installed pytorch and mmdetection according to https://github.com/open-mmlab/mmdetection/blob/master/docs/INSTALL.md

Now, when I run dist_train.sh or test.py, I get the following error:

ImportError: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c1011CPUTensorIdEv

Cuda version 10.1

Lanselott commented 5 years ago

It looks like a cuda version error. Try to reinstall pytorch with other cuda version(9.0) use conda.

Regards, Ran

From: Mudit Chaudhary notifications@github.com Sent: Saturday, October 26, 2019 12:32 PM To: muditchaudhary/FYP_RepPoints FYP_RepPoints@noreply.github.com Cc: CHEN, Ran chenran1995@link.cuhk.edu.hk; Assign assign@noreply.github.com Subject: Re: [muditchaudhary/FYP_RepPoints] Model evaluation and Training error (#2)

I followed the following steps to remove the above errors but I am still receiving a different error.

conda install -c psi4 gcc-5 #Update to gcc 5
conda install -c anaconda libstdcxx-ng

After that I installed pytorch and mmdetection according to https://github.com/open-mmlab/mmdetection/blob/master/docs/INSTALL.md

Now, when I run dist_train.sh or test.py, I get the following error:

ImportError: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c1011CPUTensorIdEv

Cuda version 10.1

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/muditchaudhary/FYP_RepPoints/issues/2?email_source=notifications&email_token=AFUY23AMSU5CPDSIVSVUAOLQQPB43A5CNFSM4JFLMHI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECJ7NFI#issuecomment-546567829, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFUY23ADZLJRNNVIBW75P23QQPB43ANCNFSM4JFLMHIQ.

muditchaudhary commented 5 years ago

I set up the environment with following:

Pytorch: 1.1.0
cuda version: 9.0
gcc: 4.8.5

I tried to train the model using: ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate and received the following error:

ImportErrorfrom . import deform_conv_cuda: 
/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32EImportError
: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

The above error also appears in https://github.com/open-mmlab/mmdetection/issues/1554

Then I setup a separate environment with:

Pytorch: 1.1.0
cuda version: 9.0 
gcc:5.4.0 #Supported by mmdetection according to INSTALL.md

(pytorch10) [mudit7@gpu38 RepPoints]$ conda list | grep cuda
cudatoolkit               9.0                  h13b8566_0  
pytorch                   1.1.0           py3.7_cuda9.0.176_cudnn7.5.1_0    pytorch

(pytorch10) [mudit7@gpu38 RepPoints]$ gcc --version
gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc

I receive the same error again.

The detailed error log is as below:

(pytorch10) [mudit7@gpu38 RepPoints]$ ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate
Traceback (most recent call last):
Traceback (most recent call last):
  File "./mmdetection/tools/train.py", line 9, in <module>
  File "./mmdetection/tools/train.py", line 9, in <module>
        from mmdet.apis import (get_root_logger, init_dist, set_random_seed,
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/__init__.py", line 2, in <module>
from mmdet.apis import (get_root_logger, init_dist, set_random_seed,
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/__init__.py", line 2, in <module>
    from .inference import (inference_detector, init_detector, show_result,    from .inference import (inference_detector, init_detector, show_result,
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/inference.py", line 10, in <module>

  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/inference.py", line 10, in <module>
    from mmdet.core import get_classes    from mmdet.core import get_classes
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/__init__.py", line 6, in <module>

  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/__init__.py", line 6, in <module>
    from .post_processing import *  # noqa: F401, F403    
from .post_processing import *  # noqa: F401, F403
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/__init__.py", line 1, in <module>
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/__init__.py", line 1, in <module>
        from .bbox_nms import multiclass_nms
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/bbox_nms.py", line 3, in <module>
from .bbox_nms import multiclass_nms
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/bbox_nms.py", line 3, in <module>
    from mmdet.ops.nms import nms_wrapper
    from mmdet.ops.nms import nms_wrapper
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/__init__.py", line 2, in <module>
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/__init__.py", line 2, in <module>
    from .dcn import (DeformConv, DeformConvPack, DeformRoIPooling,    from .dcn import (DeformConv, DeformConvPack, DeformRoIPooling,
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/__init__.py", line 1, in <module>

  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/__init__.py", line 1, in <module>
    from .deform_conv import (DeformConv, DeformConvPack, ModulatedDeformConv,    
from .deform_conv import (DeformConv, DeformConvPack, ModulatedDeformConv,
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 9, in <module>
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 9, in <module>
    from . import deform_conv_cuda    from . import deform_conv_cuda
ImportError: 
/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
ImportError: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
  File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
    cmd=process.args)
subprocess.CalledProcessError: Command '['/research/byu2/mudit7/anaconda3/envs/pytorch10/bin/python', '-u', './mmdetection/tools/train.py', '--local_rank=0', './configs/reppoints_moment_r101_fpn_2x_mt.py', '--launcher', 'pytorch', '--validate']' returned non-zero exit status 1.

Lanselott commented 5 years ago

Can you try these settings: cudatoolkit 10.0.130 0 , pytorch 1.1.0 py3.7_cuda10.0.130_cudnn7.5.1_0 pytorch

From: Mudit Chaudhary notifications@github.com Sent: Saturday, October 26, 2019 4:16 PM To: muditchaudhary/FYP_RepPoints FYP_RepPoints@noreply.github.com Cc: CHEN, Ran chenran1995@link.cuhk.edu.hk; Assign assign@noreply.github.com Subject: Re: [muditchaudhary/FYP_RepPoints] Model evaluation and Training error (#2)

I set up the environment with following:

Pytorch: 1.1.0 cuda version: 9.0 gcc: 4.8.5

I tried to train the model using: ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate and received the following error:

ImportErrorfrom . import deform_conv_cuda: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32EImportError : /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

The above error also appears in open-mmlab/mmdetection#1554https://github.com/open-mmlab/mmdetection/issues/1554

Then I setup a separate environment with:

Pytorch: 1.1.0 cuda version: 9.0 gcc:5.4.0 #Supported by mmdetection according to INSTALL.md

(pytorch10) [mudit7@gpu38 RepPoints]$ conda list | grep cuda cudatoolkit 9.0 h13b8566_0 pytorch 1.1.0 py3.7_cuda9.0.176_cudnn7.5.1_0 pytorch

I receive the same error again.

The detailed error log is as below:

(pytorch10) [mudit7@gpu38 RepPoints]$ ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate Traceback (most recent call last): Traceback (most recent call last): File "./mmdetection/tools/train.py", line 9, in File "./mmdetection/tools/train.py", line 9, in from mmdet.apis import (get_root_logger, init_dist, set_random_seed, File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/init.py", line 2, in from mmdet.apis import (get_root_logger, init_dist, set_random_seed, File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/init.py", line 2, in from .inference import (inference_detector, init_detector, show_result, from .inference import (inference_detector, init_detector, show_result, File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/inference.py", line 10, in

File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/inference.py", line 10, in from mmdet.core import get_classes from mmdet.core import get_classes File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/init.py", line 6, in

File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/init.py", line 6, in from .post_processing import # noqa: F401, F403 from .post_processing import # noqa: F401, F403 File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/init.py", line 1, in File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/init.py", line 1, in from .bbox_nms import multiclass_nms File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/bbox_nms.py", line 3, in from .bbox_nms import multiclass_nms File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/bbox_nms.py", line 3, in from mmdet.ops.nms import nms_wrapper from mmdet.ops.nms import nms_wrapper File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/init.py", line 2, in File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/init.py", line 2, in from .dcn import (DeformConv, DeformConvPack, DeformRoIPooling, from .dcn import (DeformConv, DeformConvPack, DeformRoIPooling, File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/init.py", line 1, in

File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/init.py", line 1, in from .deform_conv import (DeformConv, DeformConvPack, ModulatedDeformConv, from .deform_conv import (DeformConv, DeformConvPack, ModulatedDeformConv, File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 9, in File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 9, in from . import deform_conv_cuda from . import deform_conv_cuda ImportError: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E ImportError: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E Traceback (most recent call last): File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in main() File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/research/byu2/mudit7/anaconda3/envs/pytorch10/bin/python', '-u', './mmdetection/tools/train.py', '--local_rank=0', './configs/reppoints_moment_r101_fpn_2x_mt.py', '--launcher', 'pytorch', '--validate']' returned non-zero exit status 1.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/muditchaudhary/FYP_RepPoints/issues/2?email_source=notifications&email_token=AFUY23EQQZ7LOPWG6OEWOTLQQP4HFA5CNFSM4JFLMHI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECKCZDA#issuecomment-546581644, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFUY23FMVAAQEAUF4DXQMLDQQP4HFANCNFSM4JFLMHIQ.

Lanselott commented 5 years ago

I just tested this version, it works well.

Make sure your link path of cuda version is correct.

Regards, Ran

From: CHEN, Ran chenran1995@link.cuhk.edu.hk Sent: Saturday, October 26, 2019 5:35 PM To: muditchaudhary/FYP_RepPoints reply@reply.github.com Subject: Re: [muditchaudhary/FYP_RepPoints] Model evaluation and Training error (#2)

Can you try these settings: cudatoolkit 10.0.130 0 , pytorch 1.1.0 py3.7_cuda10.0.130_cudnn7.5.1_0 pytorch

From: Mudit Chaudhary notifications@github.com Sent: Saturday, October 26, 2019 4:16 PM To: muditchaudhary/FYP_RepPoints FYP_RepPoints@noreply.github.com Cc: CHEN, Ran chenran1995@link.cuhk.edu.hk; Assign assign@noreply.github.com Subject: Re: [muditchaudhary/FYP_RepPoints] Model evaluation and Training error (#2)

I set up the environment with following:

Pytorch: 1.1.0 cuda version: 9.0 gcc: 4.8.5

I tried to train the model using: ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate and received the following error:

ImportErrorfrom . import deform_conv_cuda: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32EImportError : /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

The above error also appears in open-mmlab/mmdetection#1554https://github.com/open-mmlab/mmdetection/issues/1554

Then I setup a separate environment with:

Pytorch: 1.1.0 cuda version: 9.0 gcc:5.4.0 #Supported by mmdetection according to INSTALL.md

(pytorch10) [mudit7@gpu38 RepPoints]$ conda list | grep cuda cudatoolkit 9.0 h13b8566_0 pytorch 1.1.0 py3.7_cuda9.0.176_cudnn7.5.1_0 pytorch

I receive the same error again.

The detailed error log is as below:

(pytorch10) [mudit7@gpu38 RepPoints]$ ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate Traceback (most recent call last): Traceback (most recent call last): File "./mmdetection/tools/train.py", line 9, in File "./mmdetection/tools/train.py", line 9, in from mmdet.apis import (get_root_logger, init_dist, set_random_seed, File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/init.py", line 2, in from mmdet.apis import (get_root_logger, init_dist, set_random_seed, File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/init.py", line 2, in from .inference import (inference_detector, init_detector, show_result, from .inference import (inference_detector, init_detector, show_result, File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/inference.py", line 10, in

File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/inference.py", line 10, in from mmdet.core import get_classes from mmdet.core import get_classes File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/init.py", line 6, in

File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/init.py", line 6, in from .post_processing import # noqa: F401, F403 from .post_processing import # noqa: F401, F403 File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/init.py", line 1, in File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/init.py", line 1, in from .bbox_nms import multiclass_nms File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/bbox_nms.py", line 3, in from .bbox_nms import multiclass_nms File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/post_processing/bbox_nms.py", line 3, in from mmdet.ops.nms import nms_wrapper from mmdet.ops.nms import nms_wrapper File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/init.py", line 2, in File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/init.py", line 2, in from .dcn import (DeformConv, DeformConvPack, DeformRoIPooling, from .dcn import (DeformConv, DeformConvPack, DeformRoIPooling, File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/init.py", line 1, in

File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/init.py", line 1, in from .deform_conv import (DeformConv, DeformConvPack, ModulatedDeformConv, from .deform_conv import (DeformConv, DeformConvPack, ModulatedDeformConv, File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 9, in File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 9, in from . import deform_conv_cuda from . import deform_conv_cuda ImportError: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E ImportError: /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E Traceback (most recent call last): File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in main() File "/research/byu2/mudit7/anaconda3/envs/pytorch10/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/research/byu2/mudit7/anaconda3/envs/pytorch10/bin/python', '-u', './mmdetection/tools/train.py', '--local_rank=0', './configs/reppoints_moment_r101_fpn_2x_mt.py', '--launcher', 'pytorch', '--validate']' returned non-zero exit status 1.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/muditchaudhary/FYP_RepPoints/issues/2?email_source=notifications&email_token=AFUY23EQQZ7LOPWG6OEWOTLQQP4HFA5CNFSM4JFLMHI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECKCZDA#issuecomment-546581644, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFUY23FMVAAQEAUF4DXQMLDQQP4HFANCNFSM4JFLMHIQ.

muditchaudhary commented 5 years ago

I used the following settings:

(pytorch12) [mudit7@gpu38 RepPoints]$ conda list | grep cuda
cudatoolkit               10.0.130                      0  
pytorch                   1.1.0           py3.7_cuda10.0.130_cudnn7.5.1_0    pytorch

gcc: 4.8.5

and got the following error:

(pytorch12) [mudit7@gpu38 RepPoints]$ ./mmdetection/tools/dist_train.sh ./configs/reppoints_moment_r101_fpn_2x_mt.py 2 --validate\
> 
2019-10-26 19:19:22,384 - INFO - Distributed training: True
2019-10-26 19:19:23,699 - INFO - load model from: modelzoo://resnet101
2019-10-26 19:19:28,773 - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

loading annotations into memory...
loading annotations into memory...
Done (t=26.10s)
creating index...
Done (t=26.50s)
creating index...
index created!
index created!
loading annotations into memory...
loading annotations into memory...
Done (t=0.78s)
creating index...
index created!
Done (t=0.72s)
creating index...
index created!
2019-10-26 19:19:59,408 - INFO - Start running, host: mudit7@gpu38.cse.cuhk.edu.hk, work_dir: /research/byu2/mudit7/FYP/RepPoints/work_dirs/reppoints_moment_r101_fpn_2x_mt
2019-10-26 19:19:59,408 - INFO - workflow: [('train', 1)], max: 24 epochs
Traceback (most recent call last):
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
    cmd=process.args)
subprocess.CalledProcessError: Command '['/research/byu2/mudit7/anaconda3/envs/pytorch12/bin/python', '-u', './mmdetection/tools/train.py', '--local_rank=0', './configs/reppoints_moment_r101_fpn_2x_mt.py', '--launcher', 'pytorch', '--validate']' died with <Signals.SIGSEGV: 11>.

I used the following settings (gcc update):

(pytorch12) [mudit7@gpu38 RepPoints]$ conda list | grep cuda
cudatoolkit               10.0.130                      0  
pytorch                   1.1.0           py3.7_cuda10.0.130_cudnn7.5.1_0    pytorch

gcc: 5.4.0

and I get the same error as earlier.

Moreover, I noticed that RepPoints does not use the master branch for their implementation. So, I believe that it might also cause some problem.

muditchaudhary commented 5 years ago

@Lanselott, How can I check the link path of cuda and modify it if its not correct?

I used conda to install it using cudatoolkit

muditchaudhary / RepPoints-x-Libra-R-CNN-x-Transformer-self-attention

Model evaluation and Training error #2