teddykoker / torchsort

Fast, differentiable sorting and ranking in PyTorch
https://pypi.org/project/torchsort/
Apache License 2.0
774 stars 34 forks source link

Problem installing when no GPU present (in docker build step for example) #10

Closed pcnudde closed 3 years ago

pcnudde commented 3 years ago

Doesn't install during docker build phase (that does not have GPUs configured).

Get error: /root/miniconda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDAFunctions.cpp:100.) return torch._C._cuda_getDeviceCount() > 0

If I install on the same image after running it with GPUs enabled it installs fine.

teddykoker commented 3 years ago

Hmm, setup.py currently checks if nvcc is installed before attempting to build the CUDA extension. Perhaps the docker image you are using has nvcc installed, but no CUDA driver, which is causing the failure. In that case maybe it should also check if torch.cuda.is_available().

I have added this to the docker-fix branch. Let me know if the following works and I'll get it merged asap.

pip install git+https://github.com/teddykoker/torchsort.git@docker-fix
pcnudde commented 3 years ago

Thanks for the quick fix. Confirming indeed that this fix works. And yes I have nvcc installed but no driver (driver is not in the docker container, but part of the host).

teddykoker commented 3 years ago

Great! I'll put this in 0.1.1 now :)

pcnudde commented 3 years ago

Sorry but I replied too quickly. With the fix, it indeed installs and I even tried the simple CPU example successfully. However running a GPU example from the main page: x = torch.tensor([[8., 0., 5., 3., 2., 1., 6., 7., 9.]], requires_grad=True).cuda() y = torchsort.soft_sort(x)

fails with: ~/miniconda/lib/python3.8/site-packages/torchsort/ops.py in forward(ctx, tensor, regularization, regularization_strength) 126 # note reverse order of args 127 if ctx.regularization == "l2": --> 128 sol = isotonic_l2[s.device.type](w - s) 129 else: 130 sol = isotonic_kl[s.device.type](w, s)

TypeError: 'NoneType' object is not callable

teddykoker commented 3 years ago

I think this is related to this issue. In the most recent release I have it skip building the CUDA extension if cuda is not available, so it will need to be installed while cuda is available; However I believe should be possible to build the CUDA extension without a CUDA runtime, so long as nvcc is installed (which it is in your docker image). PyTorch must check for an existing CUDA driver so that it knows what architecture to build for, which must be the problem you were running into initially.

I think you should be able to solve this by exporting environmental variable TORCH_CUDA_ARCH_LIST=All, which will build the extension for all of the architectures (this may result in a large binary file size but should work). This of course won't work with the most recent version (0.1.1), so perhaps try this with 0.1.0.

Let me know if this works and I can add a note to the readme. Additionally, if you could share a Dockerfile and steps to reproduce that would be very helpful!

pcnudde commented 3 years ago

Here is relatively small dockerfile that reproduces it.

teddykoker commented 3 years ago

Ah my bad, TORCH_CUDA_ARCH_LIST="Pascal;Volta;Turing" is what you need. You should be able to use just one of those if you know what architecture your GPU is. I have verified that this works locally. I'll add a note to the readme and remove the torch.cuda.is_available() check i added in the last release.

pcnudde commented 3 years ago

Thanks, I can verify that that works with 'Ampere' added for my system.