nagadomi / distro

Unofficial maintenance repository of Torch7. It supports CUDA10.1, Volta, Turing, Docker https://hub.docker.com/r/nagadomi/torch7
BSD 3-Clause "New" or "Revised" License
201 stars 57 forks source link

Issues with new CUDA version #11

Open Totemi1324 opened 4 years ago

Totemi1324 commented 4 years ago

Hello, First of all: Many thanks for making this modifications, it fixed a whole lot of my problems with installing Torch so far! In the install process though, I came across an error that is likely due to the new version of CUDA. Recently, CUDA 11 came out and I tried to build with it, the following error appears:

/home/tamas/torch/extra/cutorch/init.c: In function ‘cutorch_isManagedPtr’:
/home/tamas/torch/extra/cutorch/init.c:938:34: error: ‘struct cudaPointerAttributes’ has no member named ‘isManaged’
  938 |     lua_pushboolean(L, attributes.isManaged);
      |                                  ^
make[2]: *** [CMakeFiles/cutorch.dir/build.make:80: CMakeFiles/cutorch.dir/init.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:115: CMakeFiles/cutorch.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

Error: Build error: Failed building.

It seems that the used attribute is deprecated and no longer supported (see https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaPointerAttributes.html#structcudaPointerAttributes). Is there a chance you can fix this or am I forced to switch to CUDA 10.1?

nagadomi commented 4 years ago

I haven't tried building with CUDA11 yet. Maybe the error can be fixed with the following changes. Also probably this function(cutorch.isManaged, cutorch.toCudaUVATensor and cutorch.toFloatUVATensor) is not called from any program.

diff --git a/init.c b/init.c
index 8b32a1a..a2307bb 100644
--- a/init.c
+++ b/init.c
@@ -935,7 +935,7 @@ static int cutorch_isManagedPtr(lua_State *L)
     lua_pushboolean(L, 0);
   } else {
     THCudaCheck(res);
-    lua_pushboolean(L, attributes.isManaged);
+    lua_pushboolean(L, attributes.type == cudaMemoryTypeManaged);
   }
   return 1;
 }
Totemi1324 commented 4 years ago

Hello, I tried your solution, and it seems to make it work, however, a new error showed up. I suppose it doesn't have to do with the fix you provided, but it would have been there otherwise.

/home/tamas/torch/extra/cunn/lib/THCUNN/generic/SparseLinear.cu(95): error: identifier "cusparseScsrmm" is undefined

/home/tamas/torch/extra/cunn/lib/THCUNN/generic/SparseLinear.cu(194): error: identifier "cusparseScsrmm" is undefined

/home/tamas/torch/extra/cunn/lib/THCUNN/generic/SparseLinear.cu(97): error: identifier "cusparseDcsrmm" is undefined

/home/tamas/torch/extra/cunn/lib/THCUNN/generic/SparseLinear.cu(196): error: identifier "cusparseDcsrmm" is undefined

4 errors detected in the compilation of "/home/tamas/torch/extra/cunn/lib/THCUNN/SparseLinear.cu".
CMake Error at THCUNN_generated_SparseLinear.cu.o.cmake:267 (message):
  Error generating file
  /home/tamas/torch/extra/cunn/build/lib/THCUNN/CMakeFiles/THCUNN.dir//./THCUNN_generated_SparseLinear.cu.o

make[2]: *** [lib/THCUNN/CMakeFiles/THCUNN.dir/build.make:268: lib/THCUNN/CMakeFiles/THCUNN.dir/THCUNN_generated_SparseLinear.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:111: lib/THCUNN/CMakeFiles/THCUNN.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

Error: Build error: Failed building.

Any ideas of what I can do?

JackGarbiec commented 3 years ago

Nvidia deprecated those functions in 11 release, they recommend using a different one in the docs, but it has different arguments and would require messing with the matrices in that file to make them fit, does anyone know what this THCUNN lib even is?

nagadomi commented 3 years ago

It is linear module for sparse matrix format. I haven't used sparse matrix in torch7. So I can fix it, but I'm not confident to test it. If you're not using it, I think the easiest solution is to remove it from the library.

mw66 commented 2 years ago

If you're not using it, I think the easiest solution is to remove it from the library.

How to remove? you mean remove the whole dir extra/cunn ?

I have tried just rename these 2 files:

$ mv extra/cunn/lib/THCUNN/generic/SparseLinear.cu  extra/cunn/lib/THCUNN/generic/SparseLinear.cu.orig
$ mv extra/cunn/lib/THCUNN/SparseLinear.cu extra/cunn/lib/THCUNN/SparseLinear.cu.orig

Torch complains about some undefined symbol (e.g. THNN_CudaSparseLinear_updateOutput), but otherwise seems working, as long as your code does not call any functions in these files.

nagadomi commented 2 years ago

How to remove? you mean remove the whole dir extra/cunn ?

No, it only removes functions related to sparse matrix where CUDA is used. However, I have not tried it. On Ubuntu 21.04, qt4 is also removed and there is no ppa package. I think it is better to use the Docker version (Ubuntu 18.04 and CUDA 10).

bluedevils23 commented 2 years ago

RTX30 series card only support CUDA11, so we cannot run torch on latest card now.

kadok commented 2 years ago

How to remove? you mean remove the whole dir extra/cunn ?

No, it only removes functions related to sparse matrix where CUDA is used. However, I have not tried it. On Ubuntu 21.04, qt4 is also removed and there is no ppa package. I think it is better to use the Docker version (Ubuntu 18.04 and CUDA 10).

Hello,

I tried to follow this approach and everything works fine.

Just comment these lines in SparserLinear.cu: Line 94

/*#ifdef THC_REAL_IS_FLOAT
  cusparseScsrmm(cusparse_handle,
  #elif defined(THC_REAL_IS_DOUBLE)
  cusparseDcsrmm(cusparse_handle,
  #endif
      CUSPARSE_OPERATION_NON_TRANSPOSE,
      batchnum, outDim, inDim, nnz,
      &one,
      descr,
      THCTensor_(data)(state, values),
      THCudaIntTensor_data(state, csrPtrs),
      THCudaIntTensor_data(state, colInds),
      THCTensor_(data)(state, weight), inDim,
      &one, THCTensor_(data)(state, buffer), batchnum
  );*/

Line 193

 /*#ifdef THC_REAL_IS_FLOAT
  cusparseScsrmm(cusparse_handle,
  #elif defined(THC_REAL_IS_DOUBLE)
  cusparseDcsrmm(cusparse_handle,
  #endif
      CUSPARSE_OPERATION_NON_TRANSPOSE,
      inDim, outDim, batchnum, nnz,
      &one,
      descr,
      THCTensor_(data)(state, values),
      THCudaIntTensor_data(state, colPtrs),
      THCudaIntTensor_data(state, rowInds),
      THCTensor_(data)(state, buf), batchnum,
      &one, THCTensor_(data)(state, gradWeight), inDim
  );*/