rinongal / StyleGAN-nada

http://stylegan-nada.github.io/
MIT License
1.16k stars 146 forks source link

_run_ninja_build failed on pytorch 1.7 but success on pytorch1.4 #29

Closed jeffryWillam closed 2 years ago

jeffryWillam commented 2 years ago

Hi, thanks for your excellent work! It is really instructive! however, I met a strange compiling problem while training the network. It reminds "subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1 while building extension 'fused'. This only happens on pytorch 1.7.1 and nothing happens using torch 1.4 (but clip models seem require torch1.7). I have got stuck with this for many days and still cannot find the solution 😀. Any suggestion for this? BTW, the torch 1.7 is installed using pip rather than coda, does this matter? Many thanks for the help 🤪. The environment I use is as listed: ubuntu 20.14 pytorch 1.7.1 torchvision 0.8.2 torchaudio 0.7.2 CUDA 10.1 ninja 1.8.2

rinongal commented 2 years ago

Hi,

Installing with pip rather than conda should not be an issue. There might be a problem of mismatched versions or ninja failing to find your CUDA library path.

Could you please provide me with the full error stack?

jeffryWillam commented 2 years ago

Thanks for your help, the traceback is as follows (process.poll() in subprocess.py equals to 1 on torch1.7 but 0 on torch 1.4 )😀:

Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1533, in _run_ninja_build subprocess.run( File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 991, in _find_and_load File "", line 975, in _find_and_load_unlocked File "", line 671, in _load_unlocked File "", line 848, in exec_module File "", line 219, in _call_with_frames_removed File "/home/wanglin/PycharmProjects/StyleGan/content/stylegan_nada/ZSSGAN/model/ZSSGAN.py", line 14, in from ZSSGAN.model.sg2_model import Generator, Discriminator File "/home/wanglin/PycharmProjects/StyleGan/content/stylegan_nada/ZSSGAN/model/sg2_model.py", line 11, in from op import FusedLeakyReLU, fused_leaky_relu, upfirdn2d, conv2d_gradfix File "/home/wanglin/PycharmProjects/StyleGan/content/stylegan_nada/ZSSGAN/op/init.py", line 1, in from .fused_act import FusedLeakyReLU, fused_leaky_relu File "/home/wanglin/PycharmProjects/StyleGan/content/stylegan_nada/ZSSGAN/op/fused_act.py", line 11, in fused = load( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 986, in load return _jit_compile( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1193, in _jit_compile _write_ninja_file_and_build_library( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1297, in _write_ninja_file_and_build_library _run_ninja_build( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused': [1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++14 -c /home/wanglin/PycharmProjects/StyleGan/content/stylegan_nada/ZSSGAN/op/fused_bias_act_kernel.cu -o fused_bias_act_kernel.cuda.o FAILED: fused_bias_act_kernel.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++14 -c /home/wanglin/PycharmProjects/StyleGan/content/stylegan_nada/ZSSGAN/op/fused_bias_act_kernel.cu -o fused_bias_act_kernel.cuda.o /usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits; _Alloc = std::allocator; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’: /usr/include/c++/7/bits/basic_string.tcc:578:28: required from ‘static _CharT std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char16_t; _CharT = char16_t; _Traits = std::char_traits; _Alloc = std::allocator]’ /usr/include/c++/7/bits/basic_string.h:5042:20: required from ‘static _CharT std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::false_type) [with _InIterator = const char16_t; _CharT = char16_t; _Traits = std::char_traits; _Alloc = std::allocator]’ /usr/include/c++/7/bits/basic_string.h:5063:24: required from ‘static _CharT std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char16_t; _CharT = char16_t; _Traits = std::char_traits; _Alloc = std::allocator]’ /usr/include/c++/7/bits/basic_string.tcc:656:134: required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits; _Alloc = std::allocator; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’ /usr/include/c++/7/bits/basic_string.h:6688:95: required from here /usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char16_t; _Traits = std::char_traits; _Alloc = std::allocator]’ without object p->_M_set_sharable();


/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134:   required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6693:95:   required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’ without object
ninja: build stopped: subcommand failed.
rinongal commented 2 years ago

As far as I can see (e.g. from here and here), this seems to be a problem with specific CUDA 10.1 versions.

Things I'd try, depending on how much you're willing to mess with your CUDA installation or whether you'd rather solve it with code: 1) Upgrade to a newer version of CUDA 10.1 (seems to be fixed in 10.1.168). 2) Upgrade to CUDA 10.2 3) Work inside the docker we provide in the readme, it has all the relevant packages installed. 4) Replace the implementation of the sg2 model in ZSSGAN/model/ with the version you can find in the StyleCLIP repo. They use a modified, native-pytorch implementation which doesn't use any of the StyleGAN2 CUDA kernels. Things will run a bit (~15%) slower, but you won't be compiling any new operations and won't have this ninja issue.

jeffryWillam commented 2 years ago

Thanks for your suggestions! I will try it one by one 😀. Once resolved, I will push the report here.

jeffryWillam commented 2 years ago

Problem solved after updating the cuda 10.1 to 10.1.168. It works😀, thanks for your help!

rinongal commented 2 years ago

Happy to help!

Closing as resolved. Feel free to open a new issue if you need additional help.