peastman / openmm-build-wheels

Infrastructure to build Python wheels for OpenMM
MIT License
0 stars 0 forks source link

Cuda flavor not working #9

Closed mikemhenry closed 1 month ago

mikemhenry commented 1 month ago

I made an empty env with just python=3.10 and pip:

$ micromamba create -n openmm82-pypi-cuda pip python=3.10

I then activated the environment and installed the cuda package variant:

$ pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ openmm[cuda12]

Running test installation didn't show the cuda platform:

$ python -m openmm.testInstallation

OpenMM Version: 8.2
Git Revision: ffb3082fe7a2cec102acefb30c373db92d035968

There are 3 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 OpenCL - Successfully computed forces

Median difference in forces between platforms:

Reference vs. CPU: 6.3083e-06
Reference vs. OpenCL: 6.75018e-06
CPU vs. OpenCL: 7.64527e-07

All differences are within tolerance.

So I checked for plugin loading failures:

$ python -c "import openmm; print(openmm.Platform.getPluginLoadFailures())"
# reformatted this slightly so it is easier to read 
('Error loading library /home/mmh/micromamba/envs/openmm82-pypi-cuda/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/libOpenMMHIP.so: libhiprtc.so.6: cannot open shared object file: No such file or directory', 
'Error loading library /home/mmh/micromamba/envs/openmm82-pypi-cuda/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/libOpenMMCUDA.so: libcufft.so.11: cannot open shared object file: No such file or directory', 
'Error loading library /home/mmh/micromamba/envs/openmm82-pypi-cuda/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/libOpenMMRPMDHIP.so: libOpenMMHIP.so: cannot open shared object file: No such file or directory', 
'Error loading library /home/mmh/micromamba/envs/openmm82-pypi-cuda/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/libOpenMMRPMDCUDA.so: libOpenMMCUDA.so: cannot open shared object file: No such file or directory', 
'Error loading library /home/mmh/micromamba/envs/openmm82-pypi-cuda/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/libOpenMMDrudeHIP.so: libOpenMMHIP.so: cannot open shared object file: No such file or directory', 
'Error loading library /home/mmh/micromamba/envs/openmm82-pypi-cuda/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/libOpenMMAmoebaHIP.so: libOpenMMHIP.so: cannot open shared object file: No such file or directory', 
'Error loading library /home/mmh/micromamba/envs/openmm82-pypi-cuda/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/libOpenMMDrudeCUDA.so: libOpenMMCUDA.so: cannot open shared object file: No such file or directory', 
'Error loading library /home/mmh/micromamba/envs/openmm82-pypi-cuda/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/libOpenMMAmoebaCUDA.so: libOpenMMCUDA.so: cannot open shared object file: No such file or directory')

The file is there, but there seems to be a linking issue?

$ ldd /home/mmh/micromamba/envs/openmm82-pypi-cuda/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/libOpenMMCUDA.so
    linux-vdso.so.1 (0x00007cadf6ba9000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007cadf6b7d000)
    libOpenMM.so => not found
    libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x00007cadf4600000)
    libcufft.so.11 => not found
    libnvrtc.so.12 => not found
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007cadf6b76000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007cadf6b71000)
    libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007cadf4200000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007cadf6a8a000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007cadf6a6a000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007cadf3e00000)
    /lib64/ld-linux-x86-64.so.2 (0x00007cadf6bab000)
mikemhenry commented 1 month ago

Also sorry @peastman I didn't realize you will still debugging this!

peastman commented 1 month ago

Feel free to help debug it if you want! At the moment I'm looking into the Mac builds, which fail with a linker error. otool -l _openmm.cpython-310-darwin.so reveals

Load command 16
          cmd LC_RPATH
      cmdsize 48
         path /Users/runner/openmm-install/lib (offset 12)

That was the path to the libraries on the build machine. delocate should have replaced it...

peastman commented 1 month ago

I fixed the Mac problem. I'll look at Linux next.

peastman commented 1 month ago

Here are the commands I executed:

mamba create -c conda-forge --name test python=3.10
conda activate test
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ openmm[cuda12]
python -m openmm.testInstallation

It finds all four platforms, and they all work correctly. Possibly that's because it's linking to the libraries installed globally rather than in the environment?

$ ldd ~/miniconda3/envs/test/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/libOpenMMCUDA.so 
    linux-vdso.so.1 (0x00007ffd5ebe8000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fec1fe3c000)
    libOpenMM.so => not found
    libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x00007fec1dc9e000)
    libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x00007fec0cc00000)
    libnvrtc.so.12 => /usr/local/cuda/lib64/libnvrtc.so.12 (0x00007fec08e00000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fec1dc97000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fec1dc92000)
    libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fec08bd4000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fec1dbab000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fec1db8b000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fec089ab000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fec200ac000)

I'm never sure how to interpret that. The Python interpreter alters library linking, so the ones found within that process may not be the ones found by ldd.

peastman commented 1 month ago

It looks like there's no RPATH specified in libOpenMMCUDA.so. The CUDA libraries installed with pip get put in a bunch of folders: site-packages/nvidia/cufft/lib, site-packages/nvidia/cuda_nvrtc/lib, etc. If we want that to work, we need to make it search all of those locations.

peastman commented 1 month ago

Can you test the latest build from #13? It sets the RPATH, and I verified that it's present in the installed library:

$ objdump -x ~/miniconda3/envs/test/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/libOpenMMCUDA.so | grep PATH
  RUNPATH              $ORIGIN/..:$ORIGIN/../../../nvidia/cufft/lib:$ORIGIN/../../../nvidia/cuda_nvrtc/lib

When I run ldd on it, it reports

    libOpenMM.so => /home/peastman/miniconda3/envs/test/lib/python3.10/site-packages/OpenMM.libs/lib/plugins/../libOpenMM.so (0x00007fb1c77c9000)
    libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x00007fb1c562b000)
    libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x00007fb1b4600000)
    libnvrtc.so.12 => /usr/local/cuda/lib64/libnvrtc.so.12 (0x00007fb1b0800000)

It's successfully finding libOpenMM.so, but it's linking to globally installed versions of libcufft.so and libnvrtc.so. Presumably it's because that location is higher in the search path, and it would use the ones from the environment if there weren't global ones?

peastman commented 1 month ago

How do I delete the exiting wheel from the test server and replace it with the fixed one? If I try to delete the existing one, it warns me I won't be able to upload a new file with the same name.

mikemhenry commented 1 month ago

PyPI doesn't let you "overwrite" existing releases, so we need to change the version to something like 8.2.0rc1 or something, really whatever convention you like from https://packaging.python.org/en/latest/specifications/version-specifiers/#examples-of-compliant-version-schemes

mikemhenry commented 1 month ago

Also #13 worked!

OpenMM Version: 8.2
Git Revision: f5fc52ffd757a86aa1d05bd35a21108deff9eda1

There are 4 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Successfully computed forces
4 OpenCL - Successfully computed forces

Median difference in forces between platforms:

Reference vs. CPU: 6.29719e-06
Reference vs. CUDA: 6.74822e-06
CPU vs. CUDA: 7.47712e-07
Reference vs. OpenCL: 6.75018e-06
CPU vs. OpenCL: 7.6531e-07
CUDA vs. OpenCL: 1.78763e-07

All differences are within tolerance.