pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.61k stars 22.23k forks source link

Problem building w/ +rocm on Cray system #63083

Closed lukebroskop closed 2 years ago

lukebroskop commented 3 years ago

I'm receiving errors during the caffe2 build portion of pytorch from the cuda namespace:

/gpfs/alpine/ven114/scratch/lukebr/spack-stage/spack-stage-py-torch-1.9.0-kjecihwkmbasmqyicb6x6klsortfzxue/spack-src/torch/csrc/jit/ir/ir.cpp:1106:16: error: 'set_stream' is not a member of 'torch::jit::cuda'; did you mean 'c10::cuda::set_stream'?
 1106 |     case cuda::set_stream:
      |                ^~~~~~~~~~

Here's what I use to install py-torch@1.9.0 %gcc +rocm~cuda~mkldnn~cudnn~magma~nccl~valgrind~tensorpipe~kineto~caffe2

PyTorch Version (e.g., 1.0): 1.9.0 OS (e.g., Linux): SUSE Linux Enterprise Server 15 SP2 How you installed PyTorch (conda, pip, source): source Build command you used (if compiling from source): python setup.py install Python version: 3.8.11 CUDA/cuDNN version: Rocm 4.2.0 GPU models and configuration: MI100, gfx908

Any idea's what to try @adamjstewart ?

cc @malfet @seemethere @walterddr @jeffdaily @sunway513 @jithunnair-amd @ROCmSupport

adamjstewart commented 3 years ago

Can you share the full spack-build-out.txt and spack-build-env.txt? That will give more context.

For PyTorch folks, note that this build was done with the Spack package manager which builds everything from source.

lukebroskop commented 3 years ago

Sorry I forget to add that! spack-build-env.txt spack-build-out.txt

adamjstewart commented 3 years ago

I think there are two bugs here:

  1. Despite the USE_ROCM environment variable being set to ON, CMake is detecting that ROCM support should be disabled
  2. Despite USE_CUDA=OFF as an environment variable and in CMake, CMake is still trying to link to a CUDA installation

Will continue to dig into 1, but 2 is out of my area of expertise.

adamjstewart commented 3 years ago

Hmm, according to https://github.com/pytorch/pytorch/blob/v1.9.0/tools/setup_helpers/cmake.py#L264 USE_ROCM should be passed.

@kolamsrinivas added PyTorch + ROCm support to Spack in https://github.com/spack/spack/pull/17410, maybe they can comment on this.

jeffdaily commented 3 years ago

I'm not very familiar with spack, but is there any way to verify that ROCm is installed in your environment as a prereq?

adamjstewart commented 3 years ago

It looks like our Spack recipe is missing a dependency on ROCm. Does anyone know exactly which ROCm components are required? Right now Spack has packages for:

$ spack list rocm
==> 14 packages.
rocm-bandwidth-test  rocm-cmake   rocm-debug-agent  rocm-gdb     rocm-openmp-extras  rocm-smi-lib  rocm-validation-suite
rocm-clang-ocl       rocm-dbgapi  rocm-device-libs  rocm-opencl  rocm-smi            rocm-tensile  rocminfo

Also pinging our Spack + ROCm experts: @haampie @srekolam @arjun-raj-kuppala

srekolam commented 3 years ago

yeah, the pytorch recipe in spack is not updated for rocm recipes.. i am trying to add one by one all the relevant recipes..there are spack packages available with the same name as mentioned in https://github.com/pytorch/pytorch/blob/master/cmake/public/LoadHIP.cmake spack list hip ==> 13 packages. hip hip-rocclr hipace hipblas hipcub hipfft hipfort hipify-clang hipsparse hipsycl kahip miopen-hip r-chipseq spack list roc will show the below rocblas, rocfft,rocthrust, rocrand rocprim,roctracer-dev

srekolam commented 3 years ago

I think these below rocm packages need to be added to the recipe but i see a build failure related to {ROCM_PATH}/.info/version-dev which need to be addressed for spack case- depends_on('miopen-hip', when='+rocm') depends_on('rccl', when='+rocm') depends_on('rocprim', when='+rocm') depends_on('hipcub', when='+rocm') depends_on('rocthrust', when='+rocm') depends_on('roctracer-dev', when='+rocm') depends_on('rocrand', when='+rocm') depends_on('hipsparse', when='+rocm') depends_on('hipfft', when='+rocm') depends_on('rocfft', when='+rocm') depends_on('hsa-rocr-dev', when='+rocm') depends_on('hip', when='+rocm')

srekolam commented 2 years ago

the py-torch recipe is now updated with rocm dependencies and is building with spack externals support for rocm @lukebroskop , can we close this issue .

lukebroskop commented 2 years ago

Thanks @srekolam !