Open Spandan-Madan opened 6 years ago
@Spandan-Madan You are recommended to set the batch size=8.
If you set the batch size=1, you should also change the parameter "--gpu 0,3,4" to "--gpu 0".
Hi @PkuRainBow, That's not the cause of the Segmentation fault. I tried running it with the original parameters and these modified errors.
The error stays the same. Could you re-open the issue till it is solved please?
Thanks a lot, Spandan
I really appreciate you if you could share more information as it seems no one ever reported such errors.
It would be great if you could share the solution if you have solved this problem.
I had a similar issue and figured out the reason is because my GCC version is not update-to-date enough (I'm using GCC 4.8). Specifically PyTorch cpp_extensions (inplace_abn is using this feature) requires GCC 4.9 or higher. More info pytorch/pytorch#6987
@fanyix Thanks for your help! Hope @Spandan-Madan you have fixed this problem.
Hi @PkuRainBow and @fanyix, I realized that the error might be because of GCC version and already tried with the updated version of GCC. Error still persists.
Here are the relevant versions and error stack:-
GCC: 5.2.0
CUDA: 8.0
Pytorch: 0.4.1
Error stack:-
smadan@thousandeyes:/data/graphics/toyota-pytorch/OCNet$ python train.py --network "resnet101" --method "asp_oc_dsn" --random-mirror --random-scale --gpu 0,3,4 --batch-size 8 --snapshot-dir checkpoint/snapshots_resnet101_asp_oc_dsn_1e-2_5e-4_8_40000/ --num-steps 40000 --ohem False --data-list ./dataset/list/cityscapes/train.lst --weight-decay 5e-4 --input-size '769,769' --ohem-thres 0.7 --ohem-keep 0 --use-val False --use-weight True --restore-from ./pretrained_model/resnet101-imagenet.pth --start-iters 0 --learning-rate 1e-2 --use-extra False --dataset cityscapes_train --data-dir ./dataset/cityscapes
Traceback (most recent call last):
File "train.py", line 27, in <module>
from network import get_segmentation_model
File "/data/graphics/toyota-pytorch/OCNet/network/__init__.py", line 1, in <module>
from .resnet101_baseline import get_resnet101_baseline
File "/data/graphics/toyota-pytorch/OCNet/network/resnet101_baseline.py", line 27, in <module>
from resnet_block import conv3x3, Bottleneck
File "/data/graphics/toyota-pytorch/OCNet/network/../utils/resnet_block.py", line 19, in <module>
from bn import InPlaceABNSync
File "/data/graphics/toyota-pytorch/OCNet/utils/../inplace_abn/bn.py", line 14, in <module>
from functions import *
File "/data/graphics/toyota-pytorch/OCNet/utils/../inplace_abn/functions.py", line 18, in <module>
extra_cuda_cflags=["--expt-extended-lambda"])
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 494, in load
with_cuda=with_cuda)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 670, in _jit_compile
return _import_module_from_library(name, build_directory)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 753, in _import_module_from_library
return imp.load_module(module_name, file, path, description)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: /tmp/torch_extensions/inplace_abn/inplace_abn.so: undefined symbol: _ZN2at5ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
The error seems to be coming from the inplace_abn modules. Could there be an error in it's building?
Help greatly appreaciated :)
@Spandan-Madan There is no need to build the inplace_abn modules.
The pytorch0.4.1 support the extensions without building.
@PkuRainBow Any leads what could be causing the error?
@Spandan-Madan Have you tried delete /tmp/torch_extensions/inplace_abn and re-run? it might cache binary you compiled before
@fanyix Just tried it. Deleted /tmp/torch_extensions/inplace_abn and ran again, got the exact same error.
For more reference, here's the command I'm trying to run right now:-
python train.py --network "resnet101" --method "asp_oc_dsn" \
--random-mirror --random-scale --gpu 0 --batch-size 1 \
--snapshot-dir checkpoint/snapshots_resnet101_asp_oc_dsn_1e-2_5e-4_8_40000/ \
--num-steps 40000 --ohem False --data-list ./dataset/list/cityscapes/train.lst \
--weight-decay 5e-4 --input-size '769,769' --ohem-thres 0.7 --ohem-keep 0 --use-val False \
--use-weight True --restore-from ./pretrained_model/resnet101-imagenet.pth \
--start-iters 0 --learning-rate 1e-2 --use-extra False \
--dataset cityscapes_train --data-dir ./dataset/cityscapes
@Spandan-Madan
--batch-size 1 is not supported by the parallel operation in my implementation.
You'd better set the --batch-size equal to 8 if you could access 4 GPUs.
Besides, the --batch-size should be larger than 1 if you choose the sync-BN.
I guess it can be problem related to this setting.
@PkuRainBow Tried with 4 GPUs and --batch-size=8, same error.
The error happens way before any of this gets implemented. It happens right at the top on the line
from network import get_segmentation_model
So the batch-size issue isn't the reason (now also confirmed that it isn't the reason).
@Spandan-Madan I recommend you to open an issue in the project of the inplace-abn. https://github.com/mapillary/inplace_abn
I guess the authors of inplace-abn may help us solve the problem.
@PkuRainBow Yup, I opened an issue there :) Hopefully they can help out :)
Hi @PkuRainBow and @fanyix, I realized that the error might be because of GCC version and already tried with the updated version of GCC. Error still persists.
Here are the relevant versions and error stack:-
GCC: 5.2.0 CUDA: 8.0 Pytorch: 0.4.1
Error stack:-
smadan@thousandeyes:/data/graphics/toyota-pytorch/OCNet$ python train.py --network "resnet101" --method "asp_oc_dsn" --random-mirror --random-scale --gpu 0,3,4 --batch-size 8 --snapshot-dir checkpoint/snapshots_resnet101_asp_oc_dsn_1e-2_5e-4_8_40000/ --num-steps 40000 --ohem False --data-list ./dataset/list/cityscapes/train.lst --weight-decay 5e-4 --input-size '769,769' --ohem-thres 0.7 --ohem-keep 0 --use-val False --use-weight True --restore-from ./pretrained_model/resnet101-imagenet.pth --start-iters 0 --learning-rate 1e-2 --use-extra False --dataset cityscapes_train --data-dir ./dataset/cityscapes Traceback (most recent call last): File "train.py", line 27, in <module> from network import get_segmentation_model File "/data/graphics/toyota-pytorch/OCNet/network/__init__.py", line 1, in <module> from .resnet101_baseline import get_resnet101_baseline File "/data/graphics/toyota-pytorch/OCNet/network/resnet101_baseline.py", line 27, in <module> from resnet_block import conv3x3, Bottleneck File "/data/graphics/toyota-pytorch/OCNet/network/../utils/resnet_block.py", line 19, in <module> from bn import InPlaceABNSync File "/data/graphics/toyota-pytorch/OCNet/utils/../inplace_abn/bn.py", line 14, in <module> from functions import * File "/data/graphics/toyota-pytorch/OCNet/utils/../inplace_abn/functions.py", line 18, in <module> extra_cuda_cflags=["--expt-extended-lambda"]) File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 494, in load with_cuda=with_cuda) File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 670, in _jit_compile return _import_module_from_library(name, build_directory) File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 753, in _import_module_from_library return imp.load_module(module_name, file, path, description) File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/imp.py", line 243, in load_module return load_dynamic(name, filename, file) File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/imp.py", line 343, in load_dynamic return _load(spec) ImportError: /tmp/torch_extensions/inplace_abn/inplace_abn.so: undefined symbol: _ZN2at5ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
The error seems to be coming from the inplace_abn modules. Could there be an error in it's building?
Help greatly appreaciated :)
Hi , have you solved the problem? I came acros the same error as you
/inplace_abn/inplace_abn.so: undefined symbol: _ZN2at5ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
@Spandan-Madan Hi , have you solved the problem? I came acros the same error as you.
"/inplace_abn/inplace_abn.so: undefined symbol: _ZN2at5ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE"
Nope, the issue persists. I spent almost a week trying to debug it but then ended up using Google's DeepLab codebase for my experiments.
If you find a solution please do let me know! It would be extremely helpful.
Best, Spandan
@Spandan-Madan @Liuyixuan95
Here I share my enviroment settings,
GCC version 5.4.0 CUDA version 8.0 Pytorch: 0.4.1
@Spandan-Madan Hi, I use the inplace-abn module from https://github.com/liutinglt/CE2P to replace the file ' inplace_abn' to solve the problem .
Best,
@PkuRainBow my environment settings: GCC version 4.9.2 CUDA version 8.0 Pytorch: 0.4.1
I think it's not because of environment settings. Thanks anyway!
I succesfully build inplace-abn from https://github.com/liutinglt/CE2P
@Liuyixuan95 Great!
In fact, the author of inplace-abn have updated their implementation for multiple times.
It is great that you have solved the problem anyway.
Hope @Spandan-Madan can also solve this problem.
@Liuyixuan95 I still have the Segmentation fault.
@Spandan-Madan
@Liuyixuan95 I still have the Segmentation fault.
I got the same fault before.
thanks @Liuyixuan95
I followed @Liuyixuan95 use the inplace-abn module from https://github.com/liutinglt/CE2P to replace the file ' inplace_abn'
use CE2P/modules
to replace OCNet/inplace_abn
and modified some codes.
tree: ├── OCNet_modify │ ├── checkpoint │ ├── config │ ├── dataset │ ├── eval.py │ ├── generate_submit.py │ ├── LICENSE │ ├── modules <<---- replace OCNet/inplace_abn and inplace_abn_03 │ ├── network │ ├── oc_module │ ├── oc_module.pdf │ ├── OCNet_intro.jpg │ ├── OCNet.png │ ├── pretrained_model │ ├── README.md │ ├── requirements.txt │ ├── run_resnet101_asp_oc.sh │ ├── run_resnet101_baseline.sh │ ├── run_resnet101_base_oc.sh │ ├── train.py │ └── utils
note:
OCNet->
networks/resnet101_asp_oc.py
networks/resnet101_base_oc.py
networks/resnet101_baseline.py
networks/resnet101_pyramid_oc.py
utils/resnet_block.py
should be modified
these is an example
old codes may looks like these:
if torch_ver == '0.4':
sys.path.append(os.path.join(BASE_DIR, '../inplace_abn'))
from bn import InPlaceABNSync
BatchNorm2d = functools.partial(InPlaceABNSync, activation='none')
elif torch_ver == '0.3':
sys.path.append(os.path.join(BASE_DIR, '../inplace_abn_03'))
from modules import InPlaceABNSync
BatchNorm2d = functools.partial(InPlaceABNSync, activation='none')
modified looks like this
sys.path.append(os.path.abspath(os.path.join(BASE_DIR, '../modules')))
from bn import InPlaceABNSync
BatchNorm2d = functools.partial(InPlaceABNSync, activation='none')
hope these can help u.
@ackness ,I tried to run it according to your method, but there are still some problems, could you send me a working one? qq:2232661644, too many thanks!
The model fails to do a forward pass in the train step. The error reported is just "Segmentation fault" :-
I added a bunch of print statements and saw that the error is happening in the step
I checked the GPU usage, there was over 11GB of GPU memory free when the error occured, so it's not a memory issue. Also, when I ran the .sh file initially, it was reporting errors because the directories for log/log_train and log_test were not created. I created them manually, and that error was resolved. But not, forward pass fails in the first iteration itself. Any leads?