Closed KeyKy closed 6 years ago
Please use pytorch0.4.1.
There is no need to build the inplace-abn. Besides, please open the issue related to inplace-abn in the source repo(https://github.com/mapillary/inplace_abn) instead of here.
@PkuRainBow My pytorch is 0.4.1. I run this script ./run_resnet101_asp_oc.sh and I get this error. I can not find the script run_asp_oc.sh
update: I solve it by cuda9.0
@PkuRainBow 当我用2张显卡的时候 pred = model(images) 卡死了。 是因为跨卡bn只能支持4张吗?
@KeyKy I have tested 2 images with a single GPU and succeeded.
@PkuRainBow 谢谢 我有4块1080 Ti的GPU,只能每张卡跑batch_size=1的,总共batch_size=4。我运行的脚本是resnet101_asp_os最后我在验证集val上的指标是:
{'meanIU': 0.7837545863719357, 'IU_array': array([0.98047586, 0.84426419, 0.92612712, 0.63912455, 0.62647073,0.6292012 , 0.69047148, 0.78069628, 0.9247761 , 0.65401277,0.94481553, 0.81283084, 0.63479937, 0.94911405, 0.81046328,0.86166291, 0.75384313, 0.66243928, 0.76574848])}
这个值是正常的吗,还是说我应该要4张P100 16G的显存才能完全复现你的效果(81.2)吗?还有其他办法吗,因为我没那么好的机器。我们能不能加一下好友,认识一下~~ 我的QQ 370846270
@KeyKy Batch size很重要,batch size=8可以保证复现结果的。
@PkuRainBow 我发现https://github.com/zhuangyqin/DRN 这个貌似在leaderboard上有82.8的IOU classes。但是发现你没有使用coarse的数据,使用coarse数据应该会有点提升吧
@KeyKy Please use English for the discussion.
Besides, the huge memory cost is one of the main drawback of the self-attention method.
We are trying to propose better method to reduce the memory cost largely while maintaining the performance.
@PkuRainBow self-attention also increase the whole amount of computation. Maybe I can cut these computation of self-attention and put these by increasing Convolution operations into backbone network. Maybe I can get the same performance.
--
Reducing the memory cost is a great work, as well as the amount of computation.
@KeyKy I do not agree with you.
@PkuRainBow 我找了4块V100,跑了一个batch-size=8的实验,./run_resnet101_asp_oc.sh 最终test集上的结果是:
IoU Classes | 77.232 iIoU Classes | 54.5597 IoU Categories | 90.129 iIoU Categories | 77.602
是不是要用OHEM继续提点?我就只训练gtFineTrain,没有把gtFineVal加入训练。 看了你的论文指标,OCNet - ResNet-101 80.1 也是只使用了gtFineTrain,不知道哪里没训好
Method train mIoU(%) val mIoU(%) ResNet101+ASP-OC 85.72 79.58 (论文) ResNet101+ASP-OC 85.7 78.88 (无修改,运行./run_resnet101_asp_oc)
Hi, your results seems not as good as the number reported in the paper.
What is the version of the pytorch? Besides, you are expected to achieve ~79.5 on the validation set without OHEM.
There exist many reasons that can cause the performance not as good as the number reported in the paper. Please check the details.
I will try my best to help to reproduce all of our numbers.
Besides, I guess there should exist one class such as train, bus or truck is very bad. In fact, the Cityscapes dataset is ill-posed. It is recommended to run the script for more times if you meet such case. Maybe you could share the exact miou on every class on validation set.
@PkuRainBow pytorch version is 0.4.1, The second run
The metric of val:
{'meanIU': 0.7872904663594398, 'IU_array': array([0.98359365, 0.86381204, 0.9273207 , 0.55040593, 0.63191286, 0.65154667, 0.7169406 , 0.80020197, 0.92748692, 0.67367246, 0.95001835, 0.82902815, 0.66007207, 0.95481035, 0.80353403,0.85126102, 0.73285753, 0.66668435, 0.78335922])}
The metric of test:
Metric | Value |
---|---|
IoU Classes | 78.6579 |
iIoU Classes | 55.9391 |
IoU Categories | 90.2223 |
iIoU Categories | 78.1119 |
I think i get better result than Readme.md (78.3)?
To further improve the performance, you can employ the CriterionOhemDSN_single by setting that
USE_OHEM=True
OHEMTHRES=0.7
OHEMKEEP=100000
then you could expect to achieve ~80.4 mIoU on the validation set/ ~79.0 mIoU on the testing set (single scale). My question is when to set USE_OHEM=True, at the beginning (start_iter=0) or after iter=40000?
@KeyKy Great!
I believe you could achieve much better performance with the OHEM! I also recommend you to run the experiments for more than 2 times to ensure one performance is good.
/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/utils/cpp_extension.py:118: UserWarning:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Your compiler (c++) may be ABI-incompatible with PyTorch! Please use a compiler that is ABI-compatible with GCC 4.9 and above. See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.
See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6 for instructions on how to install GCC 4.9 or higher. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler)) Traceback (most recent call last): File "/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 759, in _build_extension_module ['ninja', '-v'], stderr=subprocess.STDOUT, cwd=build_directory) File "/data00/kangyang/python3.6/lib/python3.6/subprocess.py", line 336, in check_output **kwargs).stdout File "/data00/kangyang/python3.6/lib/python3.6/subprocess.py", line 418, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "train.py", line 27, in
from network import get_segmentation_model
File "/data00/kangyang/segmentation/OCNet/network/init.py", line 1, in
from .resnet101_baseline import get_resnet101_baseline
File "/data00/kangyang/segmentation/OCNet/network/resnet101_baseline.py", line 27, in
from resnet_block import conv3x3, Bottleneck
File "/data00/kangyang/segmentation/OCNet/network/../utils/resnet_block.py", line 19, in
from bn import InPlaceABNSync
File "/data00/kangyang/segmentation/OCNet/utils/../inplace_abn/bn.py", line 14, in
from functions import *
File "/data00/kangyang/segmentation/OCNet/utils/../inplace_abn/functions.py", line 16, in
extra_cuda_cflags=["--expt-extended-lambda"])
File "/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 514, in load
with_cuda=with_cuda)
File "/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 682, in _jit_compile
_build_extension_module(name, build_directory)
File "/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 765, in _build_extension_module
name, error.output.decode()))
RuntimeError: Error building extension 'inplace_abn': [1/4] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=inplace_abn -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/TH -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/data00/kangyang/virtualenv3.6/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options '-fPIC' --expt-extended-lambda -std=c++11 -c /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cuda.cu -o inplace_abn_cuda.cuda.o
FAILED: inplace_abn_cuda.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=inplace_abn -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/TH -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/data00/kangyang/virtualenv3.6/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options '-fPIC' --expt-extended-lambda -std=c++11 -c /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cuda.cu -o inplace_abn_cuda.cuda.o
/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/ATen/Half-inl.h(17): error: identifier "__half_as_short" is undefined
1 error detected in the compilation of "/tmp/tmpxft_00011718_00000000-7_inplace_abn_cuda.cpp1.ii". [2/4] c++ -MMD -MF inplace_abn_cpu.o.d -DTORCH_EXTENSION_NAME=inplace_abn -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/TH -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/data00/kangyang/virtualenv3.6/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++11 -O3 -c /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cpu.cpp -o inplace_abn_cpu.o /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cpu.cpp: In function ‘std::vector backward_cpu(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, bool, float)’:
/data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cpu.cpp:82:41: warning: ‘at::Tensor at::empty(const at::Type&, at::IntList)’ is deprecated (declared at /data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/ATen/Functions.h:3521) [-Wdeprecated-declarations]
auto dweight = at::empty(z.type(), {0});
^
/data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn_cpu.cpp:83:39: warning: ‘at::Tensor at::empty(const at::Type&, at::IntList)’ is deprecated (declared at /data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/ATen/Functions.h:3521) [-Wdeprecated-declarations]
auto dbias = at::empty(z.type(), {0});
^
[3/4] c++ -MMD -MF inplace_abn.o.d -DTORCH_EXTENSION_NAME=inplace_abn -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/TH -I/data00/kangyang/virtualenv3.6/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/data00/kangyang/virtualenv3.6/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++11 -O3 -c /data00/kangyang/segmentation/OCNet/inplace_abn/src/inplace_abn.cpp -o inplace_abn.o
ninja: build stopped: subcommand failed.
Could you help me with this error? What's your CUDA Version and gcc version. I use cuda8.0 and gcc 4.9.2