Open jameslamb opened 3 days ago
Alright, I don't have a full explanation and suggested fix yet, but stopping to put up some notes.
So I can see here, interactively, that the linking information looks correct at the end of the wheel build and ucx_perftest
appears to be doing what we want. Something auditwheel repair
is doing is leaving it in a bad state (just as @pentschev found on #11).
mkdir -p ./unzipped-contents
unzip \
./dist/libucx*.whl \
-d ./unzipped-contents
ldd ./unzipped-contents/libucx/bin/ucx_perftest
objdump -x ./unzipped-contents/libucx/bin/ucx_perftest | grep PATH
# RUNPATH /tmp/ucx-wheels/python/libucx/build/lib/libucx/lib
And ucx_perftest
appears to load and run successfully.
./unzipped-contents/libucx/bin/ucx_perftest
# [1729282637.710255] [ea8c4a832eab:23962:0] perftest.c:793 UCX WARN CPU affinity is not set (bound to 80 cpus).
# Performance may be impacted.
# Waiting for connection...
I installed auditwheel
in editable mode so I could fiddle around with it (dropping into a debugger, adding print statements, etc.). I patched it to print out every system command it's running.
Then ran it just as it's run in CI, but redirecting the output to a file.
From that, I see that auditwheel repair
is running the following:
patchelf --set-soname libgomp-24e2ab19.so.1.0.0 libucx_cu12.libs/libgomp-24e2ab19.so.1.0.0
patchelf --replace-needed libgomp.so.1 libgomp-24e2ab19.so.1.0.0 libucx/bin/ucx_perftest
patchelf --remove-rpath /tmp/tmp8v5ujsmi/libucx/bin/ucx_perftest
patchelf --force-rpath --set-rpath $ORIGIN/../../libucx_cu12.libs /tmp/tmp8v5ujsmi/libucx/bin/ucx_perftest
Which then leaves ucx_perftest
looking like this:
pip install ./final_dist/*.whl
SITE_PACKAGES=$(python -c "import site; print(site.getsitepackages()[0])")
objdump -x "${SITE_PACKAGES}/libucx/bin/ucx_perftest" | grep PATH
# RPATH $ORIGIN/../../libucx_cu12.libs
Notice that libucx_cu12.libs
? That's a problem.... that directory doesn't exist.
That's a default from auditwheel. The default settings for auditwheel repair
assume that all the shared libraries included in the wheel will be at {distribution_name}.libs
.
It's possible to change the .libs
part to something else via the -L / --lib-sdir
argument, but not the {distribution_name}
... that's read directly from the wheel's metadata.
Like this:
match = WHEEL_INFO_RE(wheel_fname)
dest_dir = match.group("name") + lib_sdir
We want the directory in site-packages/
to always be libucx/
(no CUDA suffix) regardless of whether libucx-cu11
or libucx-cu12
was installed, so downstream users like ucxx
can unconditionally do something like this:
import libucx
libucx.load_library()
That's customized here:
We have, for example, wheels here called libucx-cu12
(normalized to libucx_cu12
in site-packages/
) which populate site-packages/libucx
when installed.
I tried patching that installed CLI after the fact... did not work.
patchelf --print-rpath "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# $ORIGIN/../../libucx_cu12.libs
patchelf --remove-rpath "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
patchelf --print-rpath "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# (empty)
patchelf --force-rpath --set-rpath '$ORIGIN/../lib' "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# Assertion failed: splitIndex != -1 (patchelf.cc: shiftFile: 504)
# Aborted (core dumped)
patchelf --set-rpath '$ORIGIN/../lib' "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# Assertion failed: splitIndex != -1 (patchelf.cc: shiftFile: 504)
# Aborted (core dumped)
patchelf --add-rpath '$ORIGIN/../../lib' "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
# Assertion failed: splitIndex != -1 (patchelf.cc: shiftFile: 504)
# Aborted (core dumped)
And that's where I'm stuck at right now. ucx_perftest
absolute is an ELF-format binary, so I'm not sure how even patchelf
is segfaulting 😬
file "${SITE_PACKAGES}/libucx/bin/ucx_perftest"
/pyenv/versions/3.11.10/lib/python3.11/site-packages/libucx/bin/ucx_perftest:
ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2,
BuildID[sha1]=95a1fecd0e621296d4ab577c75fd66c34c8138d5,
for GNU/Linux 3.2.0, with debug_info, not stripped
I've attached the full auditwheel logs here (as a file attachment, because it's large:
Description
UCX provides a CLI,
ucx_perftest
, for running performance tests (example from UCX docs).While investigating https://github.com/rapidsai/ucx-py/issues/1072, @pentschev attempted to use that tool bundled in the wheels produced here, and found that it segfaulted immediately. The root cause looked to be missing linking information.
In #11, removing this invocation of
auditwheel repair
appeared to leave that linking in place:https://github.com/rapidsai/ucx-wheels/blob/ff1946193342bdc75c082c3d6aa153d8cdd400a9/ci/build_wheel.sh#L15
And that change alone allowed
ucx_perftest
to execute successfully 🎉That should be investigated, and changes might be required for the build here.
Reproducible Example
On an x86_64 system with CUDA 12.2
Notes
Some relevant notes in the OpenUCX docs: