Closed pcnudde closed 3 years ago
Hmm, setup.py
currently checks if nvcc
is installed before attempting to build the CUDA extension. Perhaps the docker image you are using has nvcc
installed, but no CUDA driver, which is causing the failure. In that case maybe it should also check if torch.cuda.is_available()
.
I have added this to the docker-fix
branch. Let me know if the following works and I'll get it merged asap.
pip install git+https://github.com/teddykoker/torchsort.git@docker-fix
Thanks for the quick fix. Confirming indeed that this fix works. And yes I have nvcc installed but no driver (driver is not in the docker container, but part of the host).
Great! I'll put this in 0.1.1 now :)
Sorry but I replied too quickly. With the fix, it indeed installs and I even tried the simple CPU example successfully. However running a GPU example from the main page: x = torch.tensor([[8., 0., 5., 3., 2., 1., 6., 7., 9.]], requires_grad=True).cuda() y = torchsort.soft_sort(x)
fails with: ~/miniconda/lib/python3.8/site-packages/torchsort/ops.py in forward(ctx, tensor, regularization, regularization_strength) 126 # note reverse order of args 127 if ctx.regularization == "l2": --> 128 sol = isotonic_l2[s.device.type](w - s) 129 else: 130 sol = isotonic_kl[s.device.type](w, s)
TypeError: 'NoneType' object is not callable
I think this is related to this issue. In the most recent release I have it skip building the CUDA extension if cuda is not available, so it will need to be installed while cuda is available; However I believe should be possible to build the CUDA extension without a CUDA runtime, so long as nvcc
is installed (which it is in your docker image). PyTorch must check for an existing CUDA driver so that it knows what architecture to build for, which must be the problem you were running into initially.
I think you should be able to solve this by exporting environmental variable TORCH_CUDA_ARCH_LIST=All
, which will build the extension for all of the architectures (this may result in a large binary file size but should work). This of course won't work with the most recent version (0.1.1), so perhaps try this with 0.1.0.
Let me know if this works and I can add a note to the readme. Additionally, if you could share a Dockerfile and steps to reproduce that would be very helpful!
Here is relatively small dockerfile that reproduces it.
Ah my bad, TORCH_CUDA_ARCH_LIST="Pascal;Volta;Turing"
is what you need. You should be able to use just one of those if you know what architecture your GPU is. I have verified that this works locally. I'll add a note to the readme and remove the torch.cuda.is_available()
check i added in the last release.
Thanks, I can verify that that works with 'Ampere' added for my system.
Doesn't install during docker build phase (that does not have GPUs configured).
Get error: /root/miniconda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDAFunctions.cpp:100.) return torch._C._cuda_getDeviceCount() > 0
If I install on the same image after running it with GPUs enabled it installs fine.