pytorch / tvm

TVM integration into PyTorch
449 stars 64 forks source link

Error Building torch_tvm [NGC Container] #112

Closed SrivastavaKshitij closed 4 years ago

SrivastavaKshitij commented 4 years ago

I am trying to build torch_tvm inside pytorch ngc container [19.08-py3]. However, I am encountering the same error as in #77 .

CMakeFiles/_torch_tvm.dir/build.make:218: recipe for target 'CMakeFiles/_torch_tvm.dir/torch_tvm/fusion_pass.cpp.o' failed
make[2]: *** [CMakeFiles/_torch_tvm.dir/torch_tvm/fusion_pass.cpp.o] Error 1
In file included from /tvm/torch_tvm/compiler.h:13:0,
                 from /tvm/torch_tvm/register.cpp:8:
/tvm/torch_tvm/memory_utils.h: In member function ‘void torch_tvm::utils::DLManagedTensorDeleter::operator()(DLManagedTensor*)’:
/tvm/torch_tvm/memory_utils.h:22:24: warning: deleting ‘void*’ is undefined [-Wdelete-incomplete]
       delete dl_tensor.data;
                        ^~~~
CMakeFiles/Makefile2:73: recipe for target 'CMakeFiles/_torch_tvm.dir/all' failed
make[1]: *** [CMakeFiles/_torch_tvm.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2
Traceback (most recent call last):
  File "setup.py", line 273, in <module>
    url='https://github.com/pytorch/tvm',
  File "/opt/conda/lib/python3.6/site-packages/setuptools/__init__.py", line 145, in setup
    return distutils.core.setup(**attrs)
  File "/opt/conda/lib/python3.6/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/opt/conda/lib/python3.6/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "setup.py", line 203, in run
    setuptools.command.install.install.run(self)
  File "/opt/conda/lib/python3.6/site-packages/setuptools/command/install.py", line 65, in run
    orig.install.run(self)
  File "/opt/conda/lib/python3.6/distutils/command/install.py", line 545, in run
    self.run_command('build')
  File "/opt/conda/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.6/distutils/command/build.py", line 135, in run
    self.run_command(cmd_name)
  File "/opt/conda/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "setup.py", line 187, in run
    self.run_command('cmake_build')
  File "/opt/conda/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "setup.py", line 176, in run
    self._run_build()
  File "setup.py", line 165, in _run_build
    subprocess.check_call(build_args)
  File "/opt/conda/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/cmake', '--build', '.', '--', '-j', '12']' returned non-zero exit status 2.

I tried different methods described here , here and here but I havent had any success.

How can this issue be fixed ?

SrivastavaKshitij commented 4 years ago

Downloaded llvm using wget http://releases.llvm.org/8.0.0/clang+llvm-8.0.0-x86_64-linux-gnu-ubuntu-16.04.tar.xz and made sure its in path when building torch_tvm

SrivastavaKshitij commented 4 years ago

I tried nightly build container from Pytorch docker hub pytorch/pytorch:nightly-devel-cuda10.0-cudnn7 and I encounter same errors.

kimishpatel commented 4 years ago

Can you paste repro instructions? It is not clear where the error is coming from. The memory_utils.h stuff seems like a warning and it does not seem that treating warning as error is enabled either.

SrivastavaKshitij commented 4 years ago

Repro instructions:

docker pull nvcr.io/nvidia/pytorch:19.06-py3
docker_image=nvcr.io/nvidia/pytorch:19.06-py3
docker run -e NVIDIA_VISIBLE_DEVICES=0 --gpus 0 -it --shm-size=1g --ulimit memlock=-1  --rm  -v $PWD:/workspace/work $docker_image

[Inside the container], I go to the base directory : cd /
wget http://releases.llvm.org/8.0.0/clang+llvm-8.0.0-x86_64-linux-gnu-ubuntu-16.04.tar.xz
tar -xf clang+llvm-8.0.0-x86_64-linux-gnu-ubuntu-16.04.tar.xz
export PATH=$PATH:/clang+llvm-8.0.0-x86_64-linux-gnu-ubuntu-16.04/bin/
ln -s /clang+llvm-8.0.0-x86_64-linux-gnu-ubuntu-16.04/bin/llvm-config /usr/bin/llvm-config
git clone --recursive https://github.com/pytorch/tvm.git
cd tvm/
python setup.py install --cmake

I have attached the full output:

build.txt

kimishpatel commented 4 years ago

@SrivastavaKshitij, error seems to be coming from change in pytorch API.

/tvm/torch_tvm/compiler.cpp: In static member function ‘static tvm::relay::Var TVMCompiler::convertToRelay(torch::jit::Value*, TVMContext)’:
/tvm/torch_tvm/compiler.cpp:130:39: error: ‘using element_type = struct c10::TensorType {aka struct c10::TensorType}’ has no member named ‘device’
     auto optional_device_type = pt_t->device();
                                       ^~~~~~

Maybe try with latest release?

kimishpatel commented 4 years ago

@bwasti ^^

SrivastavaKshitij commented 4 years ago

@kimishpatel : I tried the latest ngc container [19.08-py3] and have the same error

SrivastavaKshitij commented 4 years ago

I was wondering if there is any update ?

bwasti commented 4 years ago

I'm not entirely sure what version of PT NGC containers are shipping, but we've kept this repo up to date with PyTorch's master branch. Would you be able to try building PyTorch from source first? There is an API mismatch in the build that indicates you are using too old a version of PT.

SrivastavaKshitij commented 4 years ago

I have to try torch_tvm on different gpus present in different workstations and so the feasible way for me is to build one docker image and pass it around. There is a latest docker image from pytorch on Docker Hub that was released 4 days ago. I used 1.2-cuda10.0-cudnn7-devel tag and I still get the same error.

bwasti commented 4 years ago

that image is shipped with PT 1.2, which is unfortunately not compatible with torch_tvm. Can you build a docker image with PT built from source with a recent master checkout instead?

SrivastavaKshitij commented 4 years ago

Hey @bwasti : I was able to create a docker image as you suggested. It works. Here are the steps if anybody wants to install torch_tvm inside a container.

Also, is it possible to package torch_tvm as a part of pytorch container in future ? Reason: It's a very cumbersome process to install torch_tvm inside a container , phew !!

doublejtoh commented 4 years ago

Hi, @SrivastavaKshitij Thanks to your steps to install torch tvm, while following your suggestions, i successfully installed torch tvm,

but i got below import error, as you previously suffered.

Can you inform me the exact version of pytorch you built?

SrivastavaKshitij commented 4 years ago

I did it many months ago but i think it was pytorch 1.2 from master.