openseg-group / OCNet.pytorch

Please choose the openseg.pytorch project for the updated code that achieve SOTA on 6 benchmarks!
MIT License
812 stars 128 forks source link

Performance Discussion~ #22

Closed KeyKy closed 6 years ago

KeyKy commented 6 years ago

/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/utils/cpp_extension.py:118: UserWarning:

                           !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Your compiler (c++) may be ABI-incompatible with PyTorch! Please use a compiler that is ABI-compatible with GCC 4.9 and above. See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.

See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6 for instructions on how to install GCC 4.9 or higher. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                          !! WARNING !!

warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler)) Traceback (most recent call last): File "/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 759, in _build_extension_module ['ninja', '-v'], stderr=subprocess.STDOUT, cwd=build_directory) File "/data00/kangyang/python3.6/lib/python3.6/subprocess.py", line 336, in check_output **kwargs).stdout File "/data00/kangyang/python3.6/lib/python3.6/subprocess.py", line 418, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 27, in from network import get_segmentation_model File "/data00/kangyang/segmentation/OCNet/network/init.py", line 1, in from .resnet101_baseline import get_resnet101_baseline File "/data00/kangyang/segmentation/OCNet/network/resnet101_baseline.py", line 27, in from resnet_block import conv3x3, Bottleneck File "/data00/kangyang/segmentation/OCNet/network/../utils/resnet_block.py", line 19, in from bn import InPlaceABNSync File "/data00/kangyang/segmentation/OCNet/utils/../inplace_abn/bn.py", line 14, in from functions import * File "/data00/kangyang/segmentation/OCNet/utils/../inplace_abn/functions.py", line 16, in extra_cuda_cflags=["--expt-extended-lambda"]) File "/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 514, in load with_cuda=with_cuda) File "/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 682, in _jit_compile _build_extension_module(name, build_directory) File "/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 765, in _build_extension_module name, error.output.decode())) RuntimeError: Error building extension 'inplace_abn': [1/4] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=inplace_abn -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/TH -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/data00/kangyang/virtualenv3.6/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options '-fPIC' --expt-extended-lambda -std=c++11 -c /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cuda.cu -o inplace_abn_cuda.cuda.o FAILED: inplace_abn_cuda.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=inplace_abn -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/TH -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/data00/kangyang/virtualenv3.6/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options '-fPIC' --expt-extended-lambda -std=c++11 -c /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cuda.cu -o inplace_abn_cuda.cuda.o /data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/ATen/Half-inl.h(17): error: identifier "__half_as_short" is undefined

1 error detected in the compilation of "/tmp/tmpxft_00011718_00000000-7_inplace_abn_cuda.cpp1.ii". [2/4] c++ -MMD -MF inplace_abn_cpu.o.d -DTORCH_EXTENSION_NAME=inplace_abn -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/TH -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/data00/kangyang/virtualenv3.6/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++11 -O3 -c /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cpu.cpp -o inplace_abn_cpu.o /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cpu.cpp: In function ‘std::vector backward_cpu(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, bool, float)’: /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cpu.cpp:82:41: warning: ‘at::Tensor at::empty(const at::Type&, at::IntList)’ is deprecated (declared at /data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/ATen/Functions.h:3521) [-Wdeprecated-declarations] auto dweight = at::empty(z.type(), {0}); ^ /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cpu.cpp:83:39: warning: ‘at::Tensor at::empty(const at::Type&, at::IntList)’ is deprecated (declared at /data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/ATen/Functions.h:3521) [-Wdeprecated-declarations] auto dbias = at::empty(z.type(), {0}); ^ [3/4] c++ -MMD -MF inplace_abn.o.d -DTORCH_EXTENSION_NAME=inplace_abn -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/TH -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/data00/kangyang/virtualenv3.6/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++11 -O3 -c /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn.cpp -o inplace_abn.o ninja: build stopped: subcommand failed.

Could you help me with this error? What's your CUDA Version and gcc version. I use cuda8.0 and gcc 4.9.2

PkuRainBow commented 6 years ago

Please use pytorch0.4.1.

There is no need to build the inplace-abn. Besides, please open the issue related to inplace-abn in the source repo(https://github.com/mapillary/inplace_abn) instead of here.

KeyKy commented 6 years ago

@PkuRainBow My pytorch is 0.4.1. I run this script ./run_resnet101_asp_oc.sh and I get this error. I can not find the script run_asp_oc.sh


update: I solve it by cuda9.0

KeyKy commented 6 years ago

@PkuRainBow 当我用2张显卡的时候 pred = model(images) 卡死了。 是因为跨卡bn只能支持4张吗?

PkuRainBow commented 6 years ago

@KeyKy I have tested 2 images with a single GPU and succeeded.

KeyKy commented 6 years ago

@PkuRainBow 谢谢 我有4块1080 Ti的GPU,只能每张卡跑batch_size=1的,总共batch_size=4。我运行的脚本是resnet101_asp_os最后我在验证集val上的指标是:

{'meanIU': 0.7837545863719357, 'IU_array': array([0.98047586, 0.84426419, 0.92612712, 0.63912455, 0.62647073,0.6292012 , 0.69047148, 0.78069628, 0.9247761 , 0.65401277,0.94481553, 0.81283084, 0.63479937, 0.94911405, 0.81046328,0.86166291, 0.75384313, 0.66243928, 0.76574848])}

这个值是正常的吗,还是说我应该要4张P100 16G的显存才能完全复现你的效果(81.2)吗?还有其他办法吗,因为我没那么好的机器。我们能不能加一下好友,认识一下~~ 我的QQ 370846270

PkuRainBow commented 6 years ago

@KeyKy Batch size很重要,batch size=8可以保证复现结果的。

KeyKy commented 6 years ago

@PkuRainBow 我发现https://github.com/zhuangyqin/DRN 这个貌似在leaderboard上有82.8的IOU classes。但是发现你没有使用coarse的数据,使用coarse数据应该会有点提升吧

PkuRainBow commented 6 years ago

@KeyKy Please use English for the discussion.

Besides, the huge memory cost is one of the main drawback of the self-attention method.

We are trying to propose better method to reduce the memory cost largely while maintaining the performance.

KeyKy commented 6 years ago

@PkuRainBow self-attention also increase the whole amount of computation. Maybe I can cut these computation of self-attention and put these by increasing Convolution operations into backbone network. Maybe I can get the same performance.

--

Reducing the memory cost is a great work, as well as the amount of computation.

PkuRainBow commented 6 years ago

@KeyKy I do not agree with you.

KeyKy commented 6 years ago

@PkuRainBow 我找了4块V100,跑了一个batch-size=8的实验,./run_resnet101_asp_oc.sh 最终test集上的结果是:

IoU Classes | 77.232 iIoU Classes | 54.5597 IoU Categories | 90.129 iIoU Categories | 77.602

是不是要用OHEM继续提点?我就只训练gtFineTrain,没有把gtFineVal加入训练。 看了你的论文指标,OCNet - ResNet-101 80.1 也是只使用了gtFineTrain,不知道哪里没训好

Method train mIoU(%) val mIoU(%) ResNet101+ASP-OC 85.72 79.58 (论文) ResNet101+ASP-OC 85.7 78.88 (无修改,运行./run_resnet101_asp_oc)

PkuRainBow commented 6 years ago

Hi, your results seems not as good as the number reported in the paper.

What is the version of the pytorch? Besides, you are expected to achieve ~79.5 on the validation set without OHEM.

There exist many reasons that can cause the performance not as good as the number reported in the paper. Please check the details.

I will try my best to help to reproduce all of our numbers.

Besides, I guess there should exist one class such as train, bus or truck is very bad. In fact, the Cityscapes dataset is ill-posed. It is recommended to run the script for more times if you meet such case. Maybe you could share the exact miou on every class on validation set.

KeyKy commented 6 years ago

@PkuRainBow pytorch version is 0.4.1, The second run

The metric of val:

{'meanIU': 0.7872904663594398, 'IU_array': array([0.98359365, 0.86381204, 0.9273207 , 0.55040593, 0.63191286, 0.65154667, 0.7169406 , 0.80020197, 0.92748692, 0.67367246, 0.95001835, 0.82902815, 0.66007207, 0.95481035, 0.80353403,0.85126102, 0.73285753, 0.66668435, 0.78335922])}

The metric of test:

Metric Value
IoU Classes 78.6579
iIoU Classes 55.9391
IoU Categories 90.2223
iIoU Categories 78.1119

I think i get better result than Readme.md (78.3)?


To further improve the performance, you can employ the CriterionOhemDSN_single by setting that

USE_OHEM=True
OHEMTHRES=0.7
OHEMKEEP=100000

then you could expect to achieve ~80.4 mIoU on the validation set/ ~79.0 mIoU on the testing set (single scale). My question is when to set USE_OHEM=True, at the beginning (start_iter=0) or after iter=40000?

PkuRainBow commented 6 years ago

@KeyKy Great!

I believe you could achieve much better performance with the OHEM! I also recommend you to run the experiments for more than 2 times to ensure one performance is good.