[CUDA] Fix manylinux build

comaniac commented 2 years ago

I checked the errors in the latest pip build and the root cause should be just the CUDA path. Without explicitly pointing to the right CUDA version, Cmake will still find CUDA 10.2 in these images.

Meanwhile, I tried to debug the conda build failure, but I could only say it should be due to the difference conda environment in these images, but I have no idea how to resolve them.

cc @areusch @tqchen

areusch commented 2 years ago

@comaniac it does seem a little odd that just by linking against cuda 11.5 we can't find cuDNN. are you able to verify the wheels can be build if we rebuild with these docker containers? i tried to read through the build script but as best i can tell it can find all of the component libraries, just not libcudnn.so even though it is in fact present and in the same dir as the other component CUDA libs.

comaniac commented 2 years ago

Which failed are you referring to? Did you mean CUDA 11.3? The CUDA 11.3 failure is like what I said, cmake is using CUDA 10.2, as indicated in the log:

Building TVM with CUDA 11.3
-- The C compiler identification is GNU 9.3.1
-- The CXX compiler identification is GNU 9.3.1
...
-- Build with CUDA 10.2 support
-- Build with cuDNN support
-- Build with cuBLAS support
...
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Configuring done
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDA_CUDNN_LIBRARY
    linked by target "tvm_runtime" in directory /workspace/tvm
    linked by target "tvm" in directory /workspace/tvm

As can be seen, Cmake FindCUDA still locates CUDA 10.2. With the image build from PR I am able to build TVM, so I think this is the reason.

areusch commented 2 years ago

ah ok, we can merge if you've tested the build. it just seemed weird to me that CUDNN was what broke this--i think when i was looking at this last week I had found that library in both CUDA 10.2 and 11.3 so that failure seemed odd. anyhow, let's see what happens when i merge this.

tlc-pack / tlcpack

[CUDA] Fix manylinux build #122