openseg-group / OCNet.pytorch

Please choose the openseg.pytorch project for the updated code that achieve SOTA on 6 benchmarks!
MIT License
812 stars 128 forks source link

Segmentation fault #12

Open Spandan-Madan opened 6 years ago

Spandan-Madan commented 6 years ago

The model fails to do a forward pass in the train step. The error reported is just "Segmentation fault" :-

dataset          cityscapes_train
batch_size       1
data_dir         ./dataset/cityscapes
data_list        ./dataset/list/cityscapes/train.lst
ignore_label     255
input_size       769,769
is_training      False
learning_rate    0.01
momentum         0.9
not_restore_last False
num_classes      19
start_iters      0
num_steps        40000
power            0.9
random_mirror    True
random_scale     True
random_seed      304
restore_from     ./pretrained_model/resnet101-imagenet.pth
save_num_images  2
save_pred_every  5000
snapshot_dir     checkpoint/snapshots_resnet101_asp_oc_dsn_1e-2_5e-4_8_40000/
weight_decay     0.0005
gpu              0,3,4
ohem_thres       0.7
ohem_thres1      0.8
ohem_thres2      0.5
use_weight       True
use_val          False
use_extra        False
ohem             False
ohem_keep        0
network          resnet101
method           asp_oc_dsn
reduce           True
ohem_single      False
use_parallel     False
dsn_weight       0.4
pair_weight      1
seed             304
output_path      ./seg_output_eval_set
store_output     False
use_flip         False
use_ms           False
predict_choice   whole
whole_scale      1
start_epochs     0
end_epochs       120
save_epoch       20
criterion        ce
eval             False
fix_lr           False
log_file         
use_normalize_transform False
/data/graphics/toyota-pytorch/OCNet/network/../oc_module/base_oc_block.py:69: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  nn.init.constant(self.W.weight, 0)
/data/graphics/toyota-pytorch/OCNet/network/../oc_module/base_oc_block.py:70: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  nn.init.constant(self.W.bias, 0)
/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py:24: UserWarning: 
    There is an imbalance between your GPUs. You may want to exclude GPU 3 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
w/ class balance
41650 images are loaded!
learning_rate: 0.01
torch.Size([1, 3, 769, 769])
Segmentation fault

I added a bunch of print statements and saw that the error is happening in the step

preds = model(images)

I checked the GPU usage, there was over 11GB of GPU memory free when the error occured, so it's not a memory issue. Also, when I ran the .sh file initially, it was reporting errors because the directories for log/log_train and log_test were not created. I created them manually, and that error was resolved. But not, forward pass fails in the first iteration itself. Any leads?

PkuRainBow commented 6 years ago

@Spandan-Madan You are recommended to set the batch size=8.

If you set the batch size=1, you should also change the parameter "--gpu 0,3,4" to "--gpu 0".

Spandan-Madan commented 6 years ago

Hi @PkuRainBow, That's not the cause of the Segmentation fault. I tried running it with the original parameters and these modified errors.

The error stays the same. Could you re-open the issue till it is solved please?

Thanks a lot, Spandan

PkuRainBow commented 6 years ago

I really appreciate you if you could share more information as it seems no one ever reported such errors.

It would be great if you could share the solution if you have solved this problem.

fanyix commented 6 years ago

I had a similar issue and figured out the reason is because my GCC version is not update-to-date enough (I'm using GCC 4.8). Specifically PyTorch cpp_extensions (inplace_abn is using this feature) requires GCC 4.9 or higher. More info pytorch/pytorch#6987

PkuRainBow commented 6 years ago

@fanyix Thanks for your help! Hope @Spandan-Madan you have fixed this problem.

Spandan-Madan commented 6 years ago

Hi @PkuRainBow and @fanyix, I realized that the error might be because of GCC version and already tried with the updated version of GCC. Error still persists.

Here are the relevant versions and error stack:-

GCC: 5.2.0
CUDA: 8.0
Pytorch: 0.4.1

Error stack:-

smadan@thousandeyes:/data/graphics/toyota-pytorch/OCNet$ python train.py --network "resnet101" --method "asp_oc_dsn" --random-mirror --random-scale --gpu 0,3,4 --batch-size 8 --snapshot-dir checkpoint/snapshots_resnet101_asp_oc_dsn_1e-2_5e-4_8_40000/ --num-steps 40000 --ohem False --data-list ./dataset/list/cityscapes/train.lst --weight-decay 5e-4 --input-size '769,769' --ohem-thres 0.7 --ohem-keep 0 --use-val False --use-weight True --restore-from ./pretrained_model/resnet101-imagenet.pth --start-iters 0 --learning-rate 1e-2 --use-extra False --dataset cityscapes_train --data-dir ./dataset/cityscapes

Traceback (most recent call last):
  File "train.py", line 27, in <module>
    from network import get_segmentation_model
  File "/data/graphics/toyota-pytorch/OCNet/network/__init__.py", line 1, in <module>
    from .resnet101_baseline import get_resnet101_baseline
  File "/data/graphics/toyota-pytorch/OCNet/network/resnet101_baseline.py", line 27, in <module>
    from resnet_block import conv3x3, Bottleneck
  File "/data/graphics/toyota-pytorch/OCNet/network/../utils/resnet_block.py", line 19, in <module>
    from bn import InPlaceABNSync
  File "/data/graphics/toyota-pytorch/OCNet/utils/../inplace_abn/bn.py", line 14, in <module>
    from functions import *
  File "/data/graphics/toyota-pytorch/OCNet/utils/../inplace_abn/functions.py", line 18, in <module>
    extra_cuda_cflags=["--expt-extended-lambda"])
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 494, in load
    with_cuda=with_cuda)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 670, in _jit_compile
    return _import_module_from_library(name, build_directory)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 753, in _import_module_from_library
    return imp.load_module(module_name, file, path, description)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: /tmp/torch_extensions/inplace_abn/inplace_abn.so: undefined symbol: _ZN2at5ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

The error seems to be coming from the inplace_abn modules. Could there be an error in it's building?

Help greatly appreaciated :)

PkuRainBow commented 6 years ago

@Spandan-Madan There is no need to build the inplace_abn modules.

The pytorch0.4.1 support the extensions without building.

Spandan-Madan commented 6 years ago

@PkuRainBow Any leads what could be causing the error?

fanyix commented 6 years ago

@Spandan-Madan Have you tried delete /tmp/torch_extensions/inplace_abn and re-run? it might cache binary you compiled before

Spandan-Madan commented 6 years ago

@fanyix Just tried it. Deleted /tmp/torch_extensions/inplace_abn and ran again, got the exact same error.

Spandan-Madan commented 6 years ago

For more reference, here's the command I'm trying to run right now:-

python train.py --network "resnet101" --method "asp_oc_dsn" \ 
--random-mirror --random-scale --gpu 0 --batch-size 1 \
--snapshot-dir checkpoint/snapshots_resnet101_asp_oc_dsn_1e-2_5e-4_8_40000/ \
--num-steps 40000 --ohem False --data-list ./dataset/list/cityscapes/train.lst \
--weight-decay 5e-4 --input-size '769,769' --ohem-thres 0.7 --ohem-keep 0 --use-val False \
--use-weight True --restore-from ./pretrained_model/resnet101-imagenet.pth \
--start-iters 0 --learning-rate 1e-2 --use-extra False \
--dataset cityscapes_train --data-dir ./dataset/cityscapes
PkuRainBow commented 6 years ago

@Spandan-Madan

--batch-size 1 is not supported by the parallel operation in my implementation.

You'd better set the --batch-size equal to 8 if you could access 4 GPUs.

Besides, the --batch-size should be larger than 1 if you choose the sync-BN.

I guess it can be problem related to this setting.

Spandan-Madan commented 6 years ago

@PkuRainBow Tried with 4 GPUs and --batch-size=8, same error.

The error happens way before any of this gets implemented. It happens right at the top on the line

from network import get_segmentation_model

So the batch-size issue isn't the reason (now also confirmed that it isn't the reason).

PkuRainBow commented 6 years ago

@Spandan-Madan I recommend you to open an issue in the project of the inplace-abn. https://github.com/mapillary/inplace_abn

I guess the authors of inplace-abn may help us solve the problem.

Spandan-Madan commented 6 years ago

@PkuRainBow Yup, I opened an issue there :) Hopefully they can help out :)

lyxlynn commented 6 years ago

Hi @PkuRainBow and @fanyix, I realized that the error might be because of GCC version and already tried with the updated version of GCC. Error still persists.

Here are the relevant versions and error stack:-

GCC: 5.2.0
CUDA: 8.0
Pytorch: 0.4.1

Error stack:-

smadan@thousandeyes:/data/graphics/toyota-pytorch/OCNet$ python train.py --network "resnet101" --method "asp_oc_dsn" --random-mirror --random-scale --gpu 0,3,4 --batch-size 8 --snapshot-dir checkpoint/snapshots_resnet101_asp_oc_dsn_1e-2_5e-4_8_40000/ --num-steps 40000 --ohem False --data-list ./dataset/list/cityscapes/train.lst --weight-decay 5e-4 --input-size '769,769' --ohem-thres 0.7 --ohem-keep 0 --use-val False --use-weight True --restore-from ./pretrained_model/resnet101-imagenet.pth --start-iters 0 --learning-rate 1e-2 --use-extra False --dataset cityscapes_train --data-dir ./dataset/cityscapes

Traceback (most recent call last):
  File "train.py", line 27, in <module>
    from network import get_segmentation_model
  File "/data/graphics/toyota-pytorch/OCNet/network/__init__.py", line 1, in <module>
    from .resnet101_baseline import get_resnet101_baseline
  File "/data/graphics/toyota-pytorch/OCNet/network/resnet101_baseline.py", line 27, in <module>
    from resnet_block import conv3x3, Bottleneck
  File "/data/graphics/toyota-pytorch/OCNet/network/../utils/resnet_block.py", line 19, in <module>
    from bn import InPlaceABNSync
  File "/data/graphics/toyota-pytorch/OCNet/utils/../inplace_abn/bn.py", line 14, in <module>
    from functions import *
  File "/data/graphics/toyota-pytorch/OCNet/utils/../inplace_abn/functions.py", line 18, in <module>
    extra_cuda_cflags=["--expt-extended-lambda"])
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 494, in load
    with_cuda=with_cuda)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 670, in _jit_compile
    return _import_module_from_library(name, build_directory)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 753, in _import_module_from_library
    return imp.load_module(module_name, file, path, description)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/test/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: /tmp/torch_extensions/inplace_abn/inplace_abn.so: undefined symbol: _ZN2at5ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

The error seems to be coming from the inplace_abn modules. Could there be an error in it's building?

Help greatly appreaciated :)

Hi , have you solved the problem? I came acros the same error as you
/inplace_abn/inplace_abn.so: undefined symbol: _ZN2at5ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

lyxlynn commented 6 years ago

@Spandan-Madan Hi , have you solved the problem? I came acros the same error as you.

"/inplace_abn/inplace_abn.so: undefined symbol: _ZN2at5ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE"

Spandan-Madan commented 6 years ago

Nope, the issue persists. I spent almost a week trying to debug it but then ended up using Google's DeepLab codebase for my experiments.

If you find a solution please do let me know! It would be extremely helpful.

Best, Spandan

PkuRainBow commented 6 years ago

@Spandan-Madan @Liuyixuan95

Here I share my enviroment settings,

GCC version 5.4.0 CUDA version 8.0 Pytorch: 0.4.1

lyxlynn commented 6 years ago

@Spandan-Madan Hi, I use the inplace-abn module from https://github.com/liutinglt/CE2P to replace the file ' inplace_abn' to solve the problem .

Best,

lyxlynn commented 6 years ago

@PkuRainBow my environment settings: GCC version 4.9.2 CUDA version 8.0 Pytorch: 0.4.1

I think it's not because of environment settings. Thanks anyway!
I succesfully build inplace-abn from https://github.com/liutinglt/CE2P

PkuRainBow commented 6 years ago

@Liuyixuan95 Great!

In fact, the author of inplace-abn have updated their implementation for multiple times.

It is great that you have solved the problem anyway.

Hope @Spandan-Madan can also solve this problem.

jiangzhengkai commented 6 years ago

@Liuyixuan95 I still have the Segmentation fault.

ackness commented 5 years ago

@Spandan-Madan

@Liuyixuan95 I still have the Segmentation fault.

I got the same fault before. thanks @Liuyixuan95
I followed @Liuyixuan95 use the inplace-abn module from https://github.com/liutinglt/CE2P to replace the file ' inplace_abn' use CE2P/modules to replace OCNet/inplace_abn and modified some codes.

tree: ├── OCNet_modify │   ├── checkpoint │   ├── config │   ├── dataset │   ├── eval.py │   ├── generate_submit.py │   ├── LICENSE │   ├── modules <<---- replace OCNet/inplace_abn and inplace_abn_03 │   ├── network │   ├── oc_module │   ├── oc_module.pdf │   ├── OCNet_intro.jpg │   ├── OCNet.png │   ├── pretrained_model │   ├── README.md │   ├── requirements.txt │   ├── run_resnet101_asp_oc.sh │   ├── run_resnet101_baseline.sh │   ├── run_resnet101_base_oc.sh │   ├── train.py │   └── utils

note: OCNet-> networks/resnet101_asp_oc.py networks/resnet101_base_oc.py
networks/resnet101_baseline.py networks/resnet101_pyramid_oc.py utils/resnet_block.py should be modified

these is an example

old codes may looks like these:

if torch_ver == '0.4':
    sys.path.append(os.path.join(BASE_DIR, '../inplace_abn'))
    from bn import InPlaceABNSync
    BatchNorm2d = functools.partial(InPlaceABNSync, activation='none')

elif torch_ver == '0.3':
    sys.path.append(os.path.join(BASE_DIR, '../inplace_abn_03'))
    from modules import InPlaceABNSync
    BatchNorm2d = functools.partial(InPlaceABNSync, activation='none')    

modified looks like this

sys.path.append(os.path.abspath(os.path.join(BASE_DIR, '../modules')))
from bn import InPlaceABNSync
BatchNorm2d = functools.partial(InPlaceABNSync, activation='none')

hope these can help u.

iDzh commented 4 years ago

@ackness ,I tried to run it according to your method, but there are still some problems, could you send me a working one? qq:2232661644, too many thanks!