teddykoker / torchsort

Fast, differentiable sorting and ranking in PyTorch
https://pypi.org/project/torchsort/
Apache License 2.0
765 stars 33 forks source link

ninja build issue in conda environment #80

Closed kiranchari closed 6 months ago

kiranchari commented 6 months ago

I am trying to install torchsort on a HPC cluster in a conda convironment (python3.9). I made sure nvcc (CUDA 11.7) is in the path. I have followed the conda specific advice in the README as well as https://github.com/teddykoker/torchsort/issues/56:

conda install -c conda-forge gxx_linux-64
export CXX=/home/user/.conda/envs/conda_env/bin/x86_64-conda_cos7-linux-gnu-g++
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/user/.conda/envs/conda_env/lib

I get the following error in the ninja build. I have attached the full trace as well: trace.txt

Traceback (most recent call last):
File "/home/user/.conda/envs/conda_env/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
  subprocess.run(
File "/home/user/.conda/envs/conda_env/lib/python3.9/subprocess.py", line 528, in run
  raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

I also noticed that when I install torchsort, the latest version of torch (2.2.0) gets installed although I already have torch=1.13.0 installed. I want to use torch 1.13.0 and not 2.2.0. Any reason why this happens or how to prevent it?

I would appreciate any help to debug this, thanks @teddykoker!

teddykoker commented 6 months ago

Looking through the stack trace it looks like the error causing the compilation failure is:

      /home/user/.conda/envs/conda_env/include/python3.9/Python.h:44:10: fatal error: crypt.h: No such file or directory
         44 | #include <crypt.h>

I believe this header is provided by libcrypt, so you might need a figure out a way to install that on your system. Assuming you aren't able to install packages on the HPC system perhaps this conda package would work. Nevertheless, this doesn't appear to be related to torchsort itself.

Regarding torch getting upgraded during installation, how are you installing torchsort, and how is torch already installed in the environment? torch should not be getting upgraded during the install so long as it is > version 1.7. Perhaps if pip is not finding it in the environment it is installing the latest version instead. You can also try pip install --no-deps torchsort which will force it not to install any dependencies.

kiranchari commented 6 months ago

Thanks for your prompt response, @teddykoker. I installed libgcrypt using conda but that did not fix the issue. I will ask the HPC admin to help me install the missing library.

teddykoker commented 6 months ago

Great. Closing for now but let me know if you run into any more issues!

kiranchari commented 6 months ago

I managed to fix the above compilation issue by adding the path to crypt.h to $CPATH so the c/c++ compiler could see it (export CPATH=/usr/include:$CPATH). While this fixed the above issue, torchsort was still not able to run in my conda environment. Strangely, when I created a new conda environment and installed pytorch using conda (rather than pip) and then installed torchsort, it worked. I am not sure what the difference is between installing pytorch using conda vs pip with regard to torchsort. Anyway, thanks for your help @teddykoker. P.S. would these compilation issues be avoided if torchsort used numba rather than pure cuda?

On another note, I find the speed to be fairly slow for ranking tensors of size (2048, 128) at each training iteration and computing corresponding gradients in GPU mode. I am using regularization strength of 1/(128*128) as that is what I empirically found to produce accurate results. Does this affect the execution speed? Would you suggest any tips/tricks to improve torchsort's speed? Thanks!

teddykoker commented 6 months ago

Glad to hear you got it working! I'm not sure what would cause issues with conda vs. pip install of torch, but I suppose they each have slightly different mechanisms for installing the other cuda-related requirements which could effect things.

would these compilation issues be avoided if torchsort used numba rather than pure cuda?

The compilation issues that you (and others) have had are not specific to torchsort, but a result of PyTorch's CUDA/C++ extension functionality which can make building from source difficult. This can be avoided by offering pre-built binaries so that users do not have to build the extensions themselves, which is what many other PyTorch CUDA extension libraries do (see Facebook's fairscale and Microsoft's deepspeed). While I do not yet have the infrastructure set up to provide these on Conda/PYPI, I do provide pre-built binaries for several python/cuda versions, which should lower the barrier to installation.

I have not used numba for CUDA. I does require a working cudatoolkit installation but I imagine might be slightly less finicky than the pytorch extensions, although I am not sure what the performance tradeoffs would be. Triton is another option for writing custom GPU kernels in Python that I have also not tried yet.

On another note, I find the speed to be fairly slow for ranking tensors of size (2048, 128) at each training iteration and computing corresponding gradients in GPU mode. I am using regularization strength of 1/(128*128) as that is what I empirically found to produce accurate results. Does this affect the execution speed? Would you suggest any tips/tricks to improve torchsort's speed? Thanks!

Regularization should not effect execution speed. Unfortunately I do not have any tips for improving the speed. I would assume there are likely some optimizations that could be made to the kernel to improve its efficiency, but I currently do not have the bandwidth to look into this. I am however happy to answer any more questions if you'd like to look into this further yourself!

kiranchari commented 6 months ago

Thanks for your answer @teddykoker.