pytorch / extension-cpp

C++ extensions in PyTorch
1.02k stars 214 forks source link

Compiler error /cuda/setup.py #27

Open puria-izady opened 5 years ago

puria-izady commented 5 years ago

Hello,

the compilation of the setup.py in cpp is successful but, for /cuda/setup.py I get the following compile error. Therefore I would like to ask you, if you have an idea what my mistake could be.

Best regards

System:

Error log:

rrunning install
running bdist_egg
running egg_info
writing lltm_cuda.egg-info/PKG-INFO
writing dependency_links to lltm_cuda.egg-info/dependency_links.txt
writing top-level names to lltm_cuda.egg-info/top_level.txt
reading manifest file 'lltm_cuda.egg-info/SOURCES.txt'
writing manifest file 'lltm_cuda.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
building 'lltm_cuda' extension
gcc -pthread -B /pizady/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/pizady/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/pizady/anaconda3/lib/python3.6/site-packages/torch/lib/include/torch/csrc/api/include -I/pizady/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/pizady/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/pizady/anaconda3/include/python3.6m -c lltm_cuda.cpp -o build/temp.linux-x86_64-3.6/lltm_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=lltm_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from lltm_cuda.cpp:1:0:
/pizady/anaconda3/lib/python3.6/site-packages/torch/lib/include/torch/csrc/api/include/torch/torch.h:7:2: warning: #warning "Including torch/torch.h for C++ extensions is deprecated. Please include torch/extension.h" [-Wcpp]
 #warning \
  ^~~~~~~
/usr/local/cuda/bin/nvcc -I/pizady/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/pizady/anaconda3/lib/python3.6/site-packages/torch/lib/include/torch/csrc/api/include -I/pizady/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/pizady/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/pizady/anaconda3/include/python3.6m -c lltm_cuda_kernel.cu -o build/temp.linux-x86_64-3.6/lltm_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=lltm_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
lltm_cuda_kernel.cu(54): error: calling a __host__ function("std::fmax<double, float> ") from a __global__ function("_NV_ANON_NAMESPACE::lltm_cuda_forward_kernel<float> ") is not allowed

lltm_cuda_kernel.cu(54): error: identifier "std::fmax<double, float> " is undefined in device code

2 errors detected in the compilation of "/tmp/tmpxft_00000f0c_00000000-6_lltm_cuda_kernel.cpp1.ii".
goldsborough commented 5 years ago

I don't think you made any mistake.

So, for the warning:

Please include torch/extension.h

For the error, this has been asked a few times: https://github.com/pytorch/extension-cpp/issues?utf8=%E2%9C%93&q=is%3Aissue+fmax

I think the consensus was this is an environment error, and the best solution is to build PyTorch from source

dedoogong commented 5 years ago

no, it is because of cuda API. No relevance to Pytorch. just cast the second arg to (double). That's the best solution.

ClementPinard commented 5 years ago

Got the same error here.

Ubuntu 16.04
Cuda 10.0
Pytorch 1.1.0a0+7e73783 (built from source)
python 3.7

although solution from #21 seems to work. Discussion from #15 also hints that casting to scalar_t might actually be the thing to do if numbers are implicitely cast to double.

Normally i would add the (scalar_t) cast and move on, but I wanted to submit a PR (see #31) and cannot build on a clean workspace.

Any hints on what to do ? I actually could build before, (last summer) but since then, I updated my python version, along with cuda (and of course pytorch). I might try on a docker build to have a perfeclty clean install, but if the problem is common enough maybe we can add this cast on fmax (and fmin, everything to scalar_t is better than everything to double)

ClementPinard commented 5 years ago

After some investigations, it seems related to gcc version. Originally tested it in gcc-7 but it didn't work. Changed to gcc-5 with a simple "update alternatives" and now it works. pytorch was compiled from source with gcc-7.

Any idea what might have changed from gcc-5 to gcc-7 ?

soumith commented 5 years ago

I reproduced this on docker today, and fixed the issue with this commit https://github.com/pytorch/extension-cpp/commit/1031028f3b048fdea84372f3b81498db53d64d98

ClementPinard commented 5 years ago

Hi thanks for the commit ! unfortunately, I believe the fminfand fmaxf is implicitely casting everything to float32. As a consequence, the check.py and grad_check.py are now broken with cuda, because the precision is not sufficient for float64 tensors. Example output :

python check.py forward -c

Forward: Baseline (Python) vs. C++ ... Ok
Forward: Baseline (Python) vs. CUDA ... Traceback (most recent call last):
  File "check.py", line 104, in <module>
    check_forward(variables, options.cuda, options.verbose)
  File "check.py", line 45, in check_forward
    check_equal(baseline_values, cuda_values, verbose)
  File "check.py", line 22, in check_equal
    np.testing.assert_allclose(x, y, err_msg="Index: {}".format(i))
  File "/home/cpinard/anaconda3/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 1452, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/home/cpinard/anaconda3/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 789, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0
Index: 0
(mismatch 13.333333333333329%)
 x: array([-1.206306e-04,  9.878260e-01, -2.557970e-01,  3.771263e-01,
       -1.863440e-01,  5.914125e-02,  6.415094e-01,  3.132478e-04,
        1.672588e-03, -4.412979e-03, -1.300380e-01, -7.609038e-01,
        5.438342e-01,  6.241342e-02, -3.342839e-01])
 y: array([-1.206305e-04,  9.878260e-01, -2.557970e-01,  3.771263e-01,
       -1.863440e-01,  5.914125e-02,  6.415094e-01,  3.132469e-04,
        1.672588e-03, -4.412979e-03, -1.300380e-01, -7.609038e-01,
        5.438342e-01,  6.241342e-02, -3.342839e-01])
soumith commented 5 years ago

whoops, this is my bad. let me re-setup the environment and see what I can do about this.

stevewongv commented 5 years ago

@soumith Hi Soumith, do you find the solution for this precision problem? I met this problem in my C++ extension, too.

saedrna commented 1 year ago

I also encountered a similar problem. After deleting some paths in the PATH variable that I felt might cause conflicts, I was able to solve it.