cunn.test() fail - Githubissues

ysagon commented 8 years ago

I'm the sys admin of a cluster and I'm trying to make cunn work with our gpu.

I'm installing/compiling from a node without gpu (but with cuda sdk 7.5).

It's compiling fine.

I'm trying to execute the tests:

luajit -l cunn -e 'cunn.test()'

36/148 SpatialAdaptiveMaxPooling_forward_noncontig ..................... [PASS]
37/148 SpatialAveragePooling_backward .................................. [PASS]
38/148 ELU_transposed .................................................. [PASS]
39/148 SpatialBatchNormalization ....................................... [WAIT]THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6806/cutorch/init.c line=230 error=77 : an illegal memory access was encountered
39/148 SpatialBatchNormalization ....................................... [ERROR]
/home/sagon/torch/install/bin/luajit: cuda runtime error (77) : an illegal memory access was encountered at /tmp/luarocks_cutorch-scm-1-6806/cutorch/lib/THC/generic/THCStorage.c:147

In this node, there are two M2090 (compute capability 2.0)

Driver version 352.79

soumith commented 8 years ago

Hi @ysagon . Someone mentioned that this kind of error occurs if you compile for CUDA compute capability 3.0+ and run it on 2.0.

There's an environment variable you can use at compile time called TORCH_CUDA_ARCH_LIST that can manually let you specify the architecture you care about, in your case 2.0.

soumith commented 8 years ago

the build log of cunn should tell you what architectures it is being built against...

ysagon commented 8 years ago

@soumith thanks, I'm suspecting something like that too.

I have no idea how luarocks works, so I don't know if I do things correctly.

I have installed cunn like this:

export TORCH_CUDA_ARCH_LIST=2.0
[sagon@login1 torch]$ luarocks install cunn
Installing https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec...
Using https://raw.githubusercontent.com/torch/rocks/master/cunn-scm-1.rockspec... switching to 'build' mode
Initialized empty Git repository in /tmp/luarocks_cunn-scm-1-6364/cunn/.git/
remote: Counting objects: 538, done.
remote: Compressing objects: 100% (302/302), done.
remote: Total 538 (delta 357), reused 360 (delta 220), pack-reused 0
Receiving objects: 100% (538/538), 380.09 KiB, done.
Resolving deltas: 100% (357/357), done.
cmake -E make_directory build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH="/home/sagon/torch/install/bin/.." -DCMAKE_INSTALL_PREFIX="/home/sagon/torch/install/lib/luarocks/rocks/cunn/scm-1" && make -j$(getconf _NPROCESSORS_ONLN) install

-- The C compiler identification is GNU 4.4.7
-- The CXX compiler identification is GNU 4.4.7
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found Torch7 in /home/sagon/torch/install
-- Found CUDA: /usr/local/cuda (found suitable version "7.5", minimum required is "6.5") 
-- Automatic GPU detection failed. Building for all known architectures.
-- Compiling for CUDA architecture: 2.0 2.1(2.0) 3.0 3.5 5.0 5.2
-- Configuring done
-- Generating done

It seems it's compiling cunn for every architecture, which seems to be the default if auto-detection failed. Do I have to pass this variable in an other way?

Anyway, I have logged to the node with the gpus, reinstalled cunn, and this time it'w written it's compiled for cuda architecture 2.0 but the tests still failed at same place.

opterix commented 8 years ago

Hi all.

I get a similar error:

 32/134 SparseLinear_backward ........................................... [PASS]
 33/134 SpatialReflectionPadding_backward ............................... [PASS]
 34/134 SpatialAdaptiveMaxPooling_forward_noncontig ..................... [PASS]
 35/134 SpatialAveragePooling_backward .................................. [PASS]
 36/134 ELU_transposed .................................................. [PASS]
 37/134 SpatialBatchNormalization ....................................... [WAIT]
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6519/cutorch/init.c line=218 error=77 : an illegal memory access was encountered
 37/134 SpatialBatchNormalization ....................................... [ERROR]
cuda runtime error (77) : an illegal memory access was encountered at /tmp/luarocks_cutorch-scm-1-6519/cutorch/lib/THC/generic/THCStorage.c:158

The GPU in that computer is a TeslaM2075. I wonder if someone manage to solve this situation.

torch / cutorch

cunn.test() fail #442