rapidsai / ucx-wheels

BSD 3-Clause "New" or "Revised" License
1 stars 4 forks source link

ucx_perftest binary missing linking information #12

Open jameslamb opened 3 days ago

jameslamb commented 3 days ago

Description

UCX provides a CLI, ucx_perftest, for running performance tests (example from UCX docs).

While investigating https://github.com/rapidsai/ucx-py/issues/1072, @pentschev attempted to use that tool bundled in the wheels produced here, and found that it segfaulted immediately. The root cause looked to be missing linking information.

In #11, removing this invocation of auditwheel repair appeared to leave that linking in place:

https://github.com/rapidsai/ucx-wheels/blob/ff1946193342bdc75c082c3d6aa153d8cdd400a9/ci/build_wheel.sh#L15

And that change alone allowed ucx_perftest to execute successfully 🎉

That should be investigated, and changes might be required for the build here.

Reproducible Example

On an x86_64 system with CUDA 12.2

pip install 'libucx-cu12==1.17.0'
SITE_PACKAGES=$(python -c "import site; print(site.getsitepackages()[0])")

${SITE_PACKAGES}/libucx/bin/ucx_perftest
# Segmentation fault (core dumped)

ldd "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# (empty)

Notes

Some relevant notes in the OpenUCX docs:

jameslamb commented 3 days ago

Alright, I don't have a full explanation and suggested fix yet, but stopping to put up some notes.

So I can see here, interactively, that the linking information looks correct at the end of the wheel build and ucx_perftest appears to be doing what we want. Something auditwheel repair is doing is leaving it in a bad state (just as @pentschev found on #11).

full code to build and unpack wheel (click me) ```shell # get auditwheel source (used later in debugging) git clone \ git@github.com:pypa/auditwheel.git \ ./auditwheel-src docker run \ --rm \ -v $(pwd):/opt/work \ -w /opt/work \ -it rapidsai/ci-wheel:cuda12.2.2-rockylinux8-py3.11 \ bash rm -rf ./dist rm -rf ./final_dist rm -rf ./unzipped_contents rm -rf ./unzipped-post-auditwheel pip uninstall --yes auditwheel pip install -e ./auditwheel-src # move to a different directory not mounted in, to avoid those annoying docker 'permission denied' # issues when files are changed by the build process cp -R $(pwd) /tmp/ucx-wheels cd /tmp/ucx-wheels/python/libucx python -m pip wheel \ -w dist \ -v \ --no-deps \ --disable-pip-version-check \ . mkdir -p ./unzipped-contents unzip \ ./dist/libucx*.whl \ -d ./unzipped-contents ```
mkdir -p ./unzipped-contents
unzip \
    ./dist/libucx*.whl \
    -d ./unzipped-contents

ldd ./unzipped-contents/libucx/bin/ucx_perftest
ldd output (click me) ```text linux-vdso.so.1 (0x00007f922817b000) libucp.so.0 => /tmp/ucx-wheels/python/libucx/build/lib/libucx/lib/libucp.so.0 (0x00007f922807f000) libuct.so.0 => /tmp/ucx-wheels/python/libucx/build/lib/libucx/lib/libuct.so.0 (0x00007f9228036000) libucs.so.0 => /tmp/ucx-wheels/python/libucx/build/lib/libucx/lib/libucs.so.0 (0x00007f9227fbd000) libm.so.6 => /lib64/libm.so.6 (0x00007f9227bcb000) libucm.so.0 => /tmp/ucx-wheels/python/libucx/build/lib/libucx/lib/libucm.so.0 (0x00007f9227f97000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f92279c7000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f92277a7000) librt.so.1 => /lib64/librt.so.1 (0x00007f922759f000) libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f9227367000) libc.so.6 => /lib64/libc.so.6 (0x00007f9226f91000) /lib64/ld-linux-x86-64.so.2 (0x00007f9227f4d000) ```
objdump -x  ./unzipped-contents/libucx/bin/ucx_perftest | grep PATH
#   RUNPATH              /tmp/ucx-wheels/python/libucx/build/lib/libucx/lib

And ucx_perftest appears to load and run successfully.

./unzipped-contents/libucx/bin/ucx_perftest

# [1729282637.710255] [ea8c4a832eab:23962:0]        perftest.c:793  UCX  WARN  CPU affinity is not set (bound to 80 cpus). 
# Performance may be impacted.
# Waiting for connection...

I installed auditwheel in editable mode so I could fiddle around with it (dropping into a debugger, adding print statements, etc.). I patched it to print out every system command it's running.

that patch (click me) ```diff diff --git a/src/auditwheel/patcher.py b/src/auditwheel/patcher.py index 67367c9..1baca3c 100644 --- a/src/auditwheel/patcher.py +++ b/src/auditwheel/patcher.py @@ -3,7 +3,13 @@ from __future__ import annotations import re from itertools import chain from shutil import which -from subprocess import CalledProcessError, check_call, check_output +from subprocess import CalledProcessError, check_call as subpr_check_call, check_output + + +def check_call(args: list): + arg_str = " ".join(args) + print(f"(command) '{arg_str}'") + subpr_check_call(args) class ElfPatcher: diff --git a/src/auditwheel/repair.py b/src/auditwheel/repair.py index 85e3ca3..0723c6b 100644 --- a/src/auditwheel/repair.py +++ b/src/auditwheel/repair.py @@ -10,7 +10,7 @@ import stat from os.path import abspath, basename, dirname, exists, isabs from os.path import join as pjoin from pathlib import Path -from subprocess import check_call +from subprocess import check_call as subpr_check_call from typing import Iterable from auditwheel.patcher import ElfPatcher @@ -23,6 +23,10 @@ from .wheeltools import InWheelCtx, add_platforms logger = logging.getLogger(__name__) +def check_call(args: list): + arg_str = " ".join(args) + print(f"(command) '{arg_str}'") + subpr_check_call(args) # Copied from wheel 0.31.1 WHEEL_INFO_RE = re.compile( ```

Then ran it just as it's run in CI, but redirecting the output to a file.

code to do that (click me) ```shell python -m auditwheel -vvv repair \ -w final_dist \ --exclude "libcuda.so.1" \ --exclude "libnvidia-ml.so.1" \ --exclude "libucm.so.0" \ --exclude "libuct.so.0" \ --exclude "libucs.so.0" \ --exclude "libucp.so.0" \ dist/* \ > /opt/work/auditwheel.txt 2>&1 ```

From that, I see that auditwheel repair is running the following:

patchelf --set-soname libgomp-24e2ab19.so.1.0.0 libucx_cu12.libs/libgomp-24e2ab19.so.1.0.0
patchelf --replace-needed libgomp.so.1 libgomp-24e2ab19.so.1.0.0 libucx/bin/ucx_perftest
patchelf --remove-rpath /tmp/tmp8v5ujsmi/libucx/bin/ucx_perftest
patchelf --force-rpath --set-rpath $ORIGIN/../../libucx_cu12.libs /tmp/tmp8v5ujsmi/libucx/bin/ucx_perftest

Which then leaves ucx_perftest looking like this:

pip install ./final_dist/*.whl
SITE_PACKAGES=$(python -c "import site; print(site.getsitepackages()[0])")
objdump -x "${SITE_PACKAGES}/libucx/bin/ucx_perftest" | grep PATH
#   RPATH                $ORIGIN/../../libucx_cu12.libs

Notice that libucx_cu12.libs? That's a problem.... that directory doesn't exist.

That's a default from auditwheel. The default settings for auditwheel repair assume that all the shared libraries included in the wheel will be at {distribution_name}.libs.

It's possible to change the .libs part to something else via the -L / --lib-sdir argument, but not the {distribution_name} ... that's read directly from the wheel's metadata.

Like this:

match = WHEEL_INFO_RE(wheel_fname)
dest_dir = match.group("name") + lib_sdir

(auditwheel code link)

We want the directory in site-packages/ to always be libucx/ (no CUDA suffix) regardless of whether libucx-cu11 or libucx-cu12 was installed, so downstream users like ucxx can unconditionally do something like this:

import libucx
libucx.load_library()

(ucxx code link)

That's customized here:

https://github.com/rapidsai/ucx-wheels/blob/ff1946193342bdc75c082c3d6aa153d8cdd400a9/python/libucx/setup.py#L33

https://github.com/rapidsai/ucx-wheels/blob/ff1946193342bdc75c082c3d6aa153d8cdd400a9/python/libucx/setup.py#L48-L52

We have, for example, wheels here called libucx-cu12 (normalized to libucx_cu12 in site-packages/) which populate site-packages/libucx when installed.

I tried patching that installed CLI after the fact... did not work.

patchelf --print-rpath "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# $ORIGIN/../../libucx_cu12.libs

patchelf --remove-rpath "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
patchelf --print-rpath "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# (empty)

patchelf --force-rpath --set-rpath '$ORIGIN/../lib' "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# Assertion failed: splitIndex != -1 (patchelf.cc: shiftFile: 504)
# Aborted (core dumped)

patchelf --set-rpath '$ORIGIN/../lib' "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# Assertion failed: splitIndex != -1 (patchelf.cc: shiftFile: 504)
# Aborted (core dumped)

patchelf --add-rpath '$ORIGIN/../../lib' "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# Assertion failed: splitIndex != -1 (patchelf.cc: shiftFile: 504)
# Aborted (core dumped)

And that's where I'm stuck at right now. ucx_perftest absolute is an ELF-format binary, so I'm not sure how even patchelf is segfaulting 😬

file  "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
/pyenv/versions/3.11.10/lib/python3.11/site-packages/libucx/bin/ucx_perftest:
ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, 
BuildID[sha1]=95a1fecd0e621296d4ab577c75fd66c34c8138d5,
for GNU/Linux 3.2.0, with debug_info, not stripped

I've attached the full auditwheel logs here (as a file attachment, because it's large:

auditwheel-logs.txt