teddykoker / torchsort

Fast, differentiable sorting and ranking in PyTorch
https://pypi.org/project/torchsort/
Apache License 2.0
765 stars 33 forks source link

Failing to recognise torchsort cuda even when torch with cuda is successfully installed #56

Closed MushroomHunting closed 1 year ago

MushroomHunting commented 1 year ago

Thanks for the awesome package!

system: linux (arch) python: 3.8

So i've got a weird chicken-egg issue.

I successfully install pytorch using the suggested conda command

install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

and verify it recognises the gpu a = torch.tensor([1,2,3]).cuda(device=0)

then i run the demo code

ImportError: You are trying to use the torchsort CUDA extension, but it looks like it is not available. Make sure you have the CUDA toolchain installed, and reinstall torchsort with pip install --force-reinstall --no-cache-dir torchsort to rebuild the extension.

x = torch.tensor([[8., 0., 5., 3., 2., 1., 6., 7., 9.]], requires_grad=True).cuda()
y = torchsort.soft_sort(x)

and it gives me the error

ImportError: You are trying to use the torchsort CUDA extension, but it looks like it is not available. Make sure you have the CUDA toolchain installed, and reinstall torchsort with `pip install --force-reinstall --no-cache-dir torchsort` to rebuild the extension.

...so I go to install (again) it as suggested. this is where the first weirdness happens: torchsort tries to download pytorch again!

Collecting torch
Downloading torch-1.12.1-cp38-cp38-manylinux1_x86_64.whl (776.3 MB)
...
Attempting uninstall: torch
Found existing installation: torch 1.12.1
Uninstalling torch-1.12.1:
Successfully uninstalled torch-1.12.1
...
Successfully installed torch-1.12.1 torchsort-0.1.9 typing-extensions-4.4.0

so it succeeds, but when I got back into code the gpu is no longer recognised!

And then re-install again with the conda snippet and pytorch itself is working again (but torchsort with cuda doesn't)

Is there some procedure I'm missing? Or some python version weirdness that's breaking things?

cheers!

teddykoker commented 1 year ago

Hi Anthony,

Thank you for your patience! Just to be clear, are you following the conda installation steps in the README?:

  1. Install g++ with conda install -c conda-forge gxx_linux-64=9.40
  2. Run export CXX=/path/to/miniconda3/envs/env_name/bin/x86_64-conda_cos6-linux-gnu-g++
  3. Run export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/miniconda3/lib
  4. pip install --force-reinstall --no-cache-dir --no-deps torchsort
sheryc commented 1 year ago

I also encountered this problem. The reason seems to be that setup.py detects cuda installation by trying to call the nvcc command. There can be situations where cudatoolkit is installed but nvcc is not in PATH. A quick fix could be:

  1. Clone the repo
  2. After you double-checked your cudatoolkit installation, change https://github.com/teddykoker/torchsort/blob/3f81bf5209e7c65ee8ffff98ab69b259f7d793f8/setup.py#L14 to always return True
  3. cd to the root path of the repo, then run pip install -e .
teddykoker commented 1 year ago

@sheryc do you think there is a better way of checking for cudatoolkit?

sheryc commented 1 year ago

@teddykoker Is it possible to just check torch.cuda.is_available(), since torch will already be installed when running setup.py?

teddykoker commented 1 year ago

I thought about that, but I believe it is possible to have cuda drivers installed without nvcc, which is needed to compile the extension. It looks like torch.utils.cpp_extensions has a more thorough way of looking for the cuda installation even if it is not in the PATH. Perhaps something like:

import os
from functools import lru_cache
from subprocess import DEVNULL, call
from torch.utils import cpp_extension

@lru_cache(None)
def cuda_toolkit_available():
    try:
        nvcc = os.path.join(cpp_extension.CUDA_HOME, "bin", "nvcc")
        call(["nvcc"], stdout=DEVNULL, stderr=DEVNULL)
        return True
    except FileNotFoundErroxtr:
        return False

Which would probably work better?

MushroomHunting commented 1 year ago

Hey @teddykoker

I totally missed that in the readme!

I've tried with

... gxx_linux-64=9.40

but it gives me

collecting package metadata (current_repodata.json): done                                                                                                                                                                                                                                                                                                                                                                                
Solving environment: failed with initial frozen solve. Retrying with flexible solve.                                                                                                                                                                                                                                                                                                                                                     
Collecting package metadata (repodata.json): done                                                                                                                                                                                                                                                                                                                                                                                        
Solving environment: failed with initial frozen solve. Retrying with flexible solve.                                                                                                                                                                                                                                                                                                                                                     

PackagesNotFoundError: The following packages are not available from current channels:            
Current channels:                                                                                                                                                                                                                                                                                                                                                                                                                        

  - https://conda.anaconda.org/conda-forge/linux-64                                                                                                                                                                                                                                                                                                                                                                                      
  - https://conda.anaconda.org/conda-forge/noarch                                                                                                                                                                                                                                                                                                                                                                                        
  - https://repo.anaconda.com/pkgs/main/linux-64                                                                                                                                                                                                                                                                                                                                                                                         
  - https://repo.anaconda.com/pkgs/main/noarch                                                                                                                                                                                                                                                                                                                                                                                           
  - https://repo.anaconda.com/pkgs/r/linux-64                                                                                                                                                                                                                                                                                                                                                                                            
  - https://repo.anaconda.com/pkgs/r/noarch      

but if i omit the version it finds something:

The following packages will be downloaded:                                                                                                                                                                                                                                                                                                                                                                                               

    package                    |            build                                                                                                                                                                                                                                                                                                                                                                                        
    ---------------------------|-----------------                                                                                                                                                                                                                                                                                                                                                                                        
    binutils_impl_linux-64-2.38|       h2a08ee3_1         5.2 MB                                                                                                                                                                                                                                                                                                                                                                         
    binutils_linux-64-2.38.0   |       hc2dff05_0          24 KB                                                                                                                                                                                                                                                                                                                                                                         
    ca-certificates-2022.9.24  |       ha878542_0         150 KB  conda-forge                                                                                                                                                                                                                                                                                                                                                            
    certifi-2022.9.24          |     pyhd8ed1ab_0         155 KB  conda-forge                                                                                                                                                                                                                                                                                                                                                            
    conda-content-trust-0.1.3  |     pyhd8ed1ab_0          54 KB  conda-forge                                                                                                                                                                                                                                                                                                                                                            
    gcc_impl_linux-64-11.2.0   |       h1234567_1        22.2 MB
    gcc_linux-64-11.2.0        |       h5c386dc_0          25 KB
    gxx_impl_linux-64-11.2.0   |       h1234567_1        10.6 MB
    gxx_linux-64-11.2.0        |       hc2dff05_0          24 KB
    libgcc-devel_linux-64-11.2.0|       h1234567_1         2.5 MB
    libstdcxx-devel_linux-64-11.2.0|       h1234567_1        14.6 MB

So i went ahead, and exported paths as in the readme, using my own env and linking to /opt/anaconda/lib

but it gives a bunch of compile errors

ERROR: Command errored out with exit status 1: /home/ein/.conda/envs/py38bebop/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-h3jcy8sx/torchsort_81e3893b66cb4c6990aa70f634e4ac0e/setup.py'"'"'; __file__='"'"'/tmp/pip-install-h3jcy8sx/torchsort_81e3893b66cb4c6990aa70f634e4ac0e/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else
 io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-0zyu795c/install-record.txt --single-version-externally-managed --compile --install-headers /home/ein/.conda/envs/py38bebop/include/python3.8/torchsort Check the logs for full command output.

I also noticed I only have

/path/to/miniconda3/envs/env_name/bin/x86_64-conda_cos6-linux-gnu

and not

/path/to/miniconda3/envs/env_name/bin/x86_64-conda_cos6-linux-gnu-g++

(i also tried sheryc's suggestion but that failed)

i'm a bit lost :laughing:

teddykoker commented 1 year ago

@MushroomHunting do you happen to have the full logs for the compile errors? Might help us narrow down the issue.

MushroomHunting commented 1 year ago

i've attached the terminal trace

for reference, the exact commands i ran as per the readme were:

sudo conda install -c conda-forge gxx_linux-64
export CXX=/home/ein/.conda/envs/py38bebop/x86_64-conda_cos6-linux-gnu
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/anaconda/lib
pip install --force-reinstall --no-cache-dir --no-deps torchsort

trace.txt

teddykoker commented 1 year ago

@MushroomHunting it looks like your CXX (the c++ compiler) path is likely incomplete. I'm guessing it would be one of the following:

export CXX=/home/ein/.conda/envs/py38bebop/x86_64-conda_cos6-linux-gnu-g++
export CXX=/home/ein/.conda/envs/py38bebop/bin/x86_64-conda_cos6-linux-gnu-g++

This is evident by the trace:

File "/home/ein/.conda/envs/py38bebop/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 280, in check_compiler_ok_for_platform
        which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
      File "/home/ein/.conda/envs/py38bebop/lib/python3.8/subprocess.py", line 415, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "/home/ein/.conda/envs/py38bebop/lib/python3.8/subprocess.py", line 516, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['which', '/home/ein/.conda/envs/py38bebop/x86_64-conda_cos6-linux-gnu']' returned non-zero exit status 1.

Showing the command which failing, as the path likely does not exists (this is a check done by PyTorch before building the extension). You can verify that your CXX path is correct by running $CXX --version which should confirm that the c++ compiler is installed and runs

MushroomHunting commented 1 year ago

Got some progress!

the compiler was in the /bin/ folder and this time it successfully compiled!

Collecting torchsort
  Downloading torchsort-0.1.9.tar.gz (12 kB)
Building wheels for collected packages: torchsort
  Building wheel for torchsort (setup.py) ... done
  Created wheel for torchsort: filename=torchsort-0.1.9-cp38-cp38-linux_x86_64.whl size=85341 sha256=a455aeafab02d6c445beb2bd2033a1ab594792078c9f5a735331c732d8f3fe9d
  Stored in directory: /tmp/pip-ephem-wheel-cache-bmg93oki/wheels/be/e8/6a/e7a9207a4e2652d424aa5705c2b787a93775f32ab61c0c08a4
Successfully built torchsort
Installing collected packages: torchsort
  Attempting uninstall: torchsort
    Found existing installation: torchsort 0.1.9
    Uninstalling torchsort-0.1.9:
      Successfully uninstalled torchsort-0.1.9
Successfully installed torchsort-0.1.9

but when I run the demo code it gives error:

Traceback (most recent call last):
  File "/home/ein/.conda/envs/py38bebop/lib/python3.8/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 2, in <module>
  File "/home/ein/.conda/envs/py38bebop/lib/python3.8/site-packages/torchsort/ops.py", line 56, in soft_sort
    return SoftSort.apply(values, regularization, regularization_strength)
  File "/home/ein/.conda/envs/py38bebop/lib/python3.8/site-packages/torchsort/ops.py", line 140, in forward
    sol = isotonic_l2[s.device.type](w - s)
  File "/home/ein/.conda/envs/py38bebop/lib/python3.8/site-packages/torchsort/ops.py", line 31, in _error
    raise ImportError(
ImportError: You are trying to use the torchsort CUDA extension, but it looks like it is not available. Make sure you have the CUDA toolchain installed, and reinstall torchsort with `pip install --force-reinstall --no-cache-dir torchsort` to rebuild the extension.

When I run$CXX --version it gives x86_64-conda_cos6-linux-gnu-g++ (crosstool-NG 1.24.0.133_b0863d8_dirty) 9.3.0

From the readme it sounds like it should be 9.40 ? (or is that a typo and should be 9.4.0? I tried both but conda can't find either to install)

EDIT: So conda could find and install 9.4.0. I reinstalled but still get the same CUDA error

teddykoker commented 1 year ago

Okay great. I'm guessing the particular version of g++ doesn't really matter. Does the demo work without CUDA? If so, that means at least the C++ extension compiled correctly.

If it is now just the CUDA portion that is not working, I'm thinking this could now be the problem above with setup.py checking for nvcc. If you run which nvcc does anything show up?

teddykoker commented 1 year ago

Looking at this SO post, it appears nvcc is not actually included in cudatoolkit but can be installed with:

conda install -c conda-forge cudatoolkit-dev=<version>

nvcc should then be in your PATH, which you can verify with which nvcc, and if that all works reinstalling the extension should work.

MushroomHunting commented 1 year ago

@teddykoker Amazing. This solved it; thank you so much!

happy days:D

tensor([[0.5556, 1.5556, 2.5556, 3.5556, 4.5556, 5.5556, 6.5556, 7.5556, 8.5556]], cuda:0', grad_fn=<SoftSortBackward>)