torch / torch7

http://torch.ch
Other
9k stars 2.38k forks source link

installation to custom location issues #661

Open phanousk opened 8 years ago

phanousk commented 8 years ago

Hi there, I have just tried, as a cluster administrator, to install torch7 to a location where we have our software modules. The procedure was like this:

module add qt-4.8.5 mkl-11.0 cmake-3.2.3 cuda-7.5

#libzmq
git clone https://github.com/zeromq/libzmq.git
cd libzmq
./autogen.sh
./configure --prefix=/software/torch/20160425-deb8
make
make install

#torch
git clone https://github.com/torch/distro.git ./torch --recursive
export CMAKE_INCLUDE_PATH=/software/mkl-11.0/composer_xe_2013.0.079/mkl/include:/software/torch/20160425-deb8/include:$CMAKE_INCLUDE_PATH
export CMAKE_LIBRARY_PATH=/software/mkl-11.0/composer_xe_2013.0.079/mkl/lib/intel64:/software/torch/20160425-deb8/lib:$CMAKE_LIBRARY_PATH
#Edit install.sh and update PREFIX path
cd ./torch
PREFIX=/software/torch/20160425-deb8 ./install.sh
export PATH=$PATH:/software/torch/20160425-deb8/bin
luarocks install lzmq ZMQ_DIR=/software/torch/20160425-deb8

Everything seemed to install correctly, however then one of our users encountered this error:

THCudaCheck FAIL file=/scratch.ssd/hanousek/
job_11137891.arien.ics.muni.cz/torch/extra/cutorch/lib/THC/THCTensorMathPairwise.cu
line=36 error=8 : invalid device function
/afs/.ics.muni.cz/software/torch/20160425-deb8/bin/luajit:
...tware/torch/20160425-deb8/share/lua/5.1/nn/Container.lua:67:
In 2 module of nn.Sequential:
...are/torch/20160425-deb8/share/lua/5.1/nn/AddConstant.lua:22: cuda
runtime error (8) : invalid device function at /scratch.ssd/hanousek/
job_11137891.arien.ics.muni.cz/torch/extra/cutorch/lib/THC/THCTensorMathPairwise.cu:36
stack traceback:
        [C]: in function 'add'
        ...are/torch/20160425-deb8/share/lua/5.1/nn/AddConstant.lua:22: in
function <...are/torch/20160425-deb8/share/lua/5.1/nn/AddConstant.lua:15>
        [C]: in function 'xpcall'
        ...tware/torch/20160425-deb8/share/lua/5.1/nn/Container.lua:63: in
function 'rethrowErrors'
        ...ware/torch/20160425-deb8/share/lua/5.1/nn/Sequential.lua:44: in
function 'forward'
        ./training.lua:50: in function 'validation'
        ./training.lua:88: in function 'trainModel'
        main_train.lua:51: in main chunk
        [C]: in function 'dofile'
        ...orch/20160425-deb8/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in
main chunk
        [C]: at 0x00405d30

This error happened because the directory

/scratch.ssd/hanousek/job_11137891.arien.ics.muni.cz/torch...

no more existed. It was only my temporary building directory till the installation took a place. Could you please rewrite the installation procedure to respect custom set PREFIX in all it's parts? This looks like only the core parts are influenced by PREFIX and extras (and maybe more parts) not. Thank you, Petr

albanD commented 8 years ago

Hi,

I don't think the problem comes from the non existing folder. Isn't it just the debugging info in the lib that contains the path to the original file it was compiled from? The CUDA error 8 means that the CUDA binary does not contain the correct architecture to run on your harware. I guess this could come from the fact that the architecture of the GPU you are trying to use was not properly detected during compilation. There should be a line like -- Compiling for CUDA architecture: xx during the cutorch installation procedure. Does the architecture stated here match the one of the GPUs in your cluster?

phanousk commented 8 years ago

Well, first I was thinking like this but then tried to recompile the whole project from the permanently accessible directory (/software/torch/20160425-deb8/build) and everything works as needed.

If I want to see the -- Compiling for CUDA architecture: xx I guess I have to run the compilation again and gather some info? I can not see such a thing in after installation logs in all torch build directory.

albanD commented 8 years ago

To see the CUDA architecture flags, you can run luarocks install cutorch, this line is at the beginning (before all the warnings).