rapidsai / ucx-py

Python bindings for UCX
https://ucx-py.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
120 stars 57 forks source link

implicit libnuma.so.1 dependency added in new ucx-1.11.1 in conda package #790

Open pseudotensor opened 2 years ago

pseudotensor commented 2 years ago

ldd shows:

python/lib/python3.8/site-packages/ucp/_libs/ucx_api.cpython-38-x86_64-linux-gnu.so:
        libnuma.so.1 => not found

for the

https://conda.anaconda.org/rapidsai/linux-64/ucx-1.11.1+gc58db6b-cuda11.2_0.tar.bz2
https://conda.anaconda.org/rapidsai/linux-64/ucx-proc-1.0.0-gpu.tar.bz2
https://conda.anaconda.org/rapidsai/linux-64/ucx-py-0.21.0-py38_gc58db6b_0.tar.bz2

after building non-conflicting conda solution.

This is a new dependency that was not present prior to 3 days ago when new ucx was uploaded to https://anaconda.org/rapidsai/ucx/files

I expect it is a mistake that this dependency was forced since no corresponding package dependency installs libnuma

So one now gets things like:

ImportError while importing test module '/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/tests/test_balanced_cut.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/h2oai/dai/python/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/__init__.py:14: in <module>
    from cugraph.community import (
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/community/__init__.py:14: in <module>
    from cugraph.community.louvain import louvain
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/community/louvain.py:14: in <module>
    from cugraph.community import louvain_wrapper
cugraph/community/louvain_wrapper.pyx:21: in init cugraph.community.louvain_wrapper
    ???
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/structure/__init__.py:14: in <module>
    from cugraph.structure.graph_classes import (Graph,
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/structure/graph_classes.py:15: in <module>
    from .graph_implementation import (simpleGraphImpl,
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/structure/graph_implementation/__init__.py:14: in <module>
    from .simpleGraph import simpleGraphImpl
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/structure/graph_implementation/simpleGraph.py:14: in <module>
    from cugraph.structure import graph_primtypes_wrapper
cugraph/structure/graph_primtypes_wrapper.pyx:29: in init cugraph.structure.graph_primtypes_wrapper
    ???
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/comms/comms.py:14: in <module>
    from cugraph.raft.dask.common.comms import Comms as raftComms
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/raft/dask/__init__.py:16: in <module>
    from .common.comms import Comms
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/raft/dask/common/__init__.py:16: in <module>
    from .comms import Comms
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/raft/dask/common/comms.py:17: in <module>
    from .ucx import UCX
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/raft/dask/common/ucx.py:16: in <module>
    import ucp
/opt/h2oai/dai/python/lib/python3.8/site-packages/ucp/__init__.py:10: in <module>
    from .core import *  # noqa
/opt/h2oai/dai/python/lib/python3.8/site-packages/ucp/core.py:16: in <module>
    from . import comm
/opt/h2oai/dai/python/lib/python3.8/site-packages/ucp/comm.py:7: in <module>
    from ._libs import arr, ucx_api
E   ImportError: libnuma.so.1: cannot open shared object file: No such file or directory
pentschev commented 2 years ago

It was always the intent to depend on libnuma, see https://github.com/rapidsai/ucx-split-feedstock/blob/master/recipe/install_ucx.sh#L19 , and our users were always instructed to install it from their OS package manager. However, recently a bug was discovered and fixed in https://github.com/openucx/ucx/pull/6782 that would not enable NUMA when passed explicitly as we do in our conda recipes.

In previous UCX 1.9 packages we didn't have that dependency due to the UCX bug above, and in new UCX 1.11 packages we do, as it's important for UCX in certain systems. Apparently, we could depend on https://anaconda.org/conda-forge/numactl-libs-cos7-x86_64 to resolve that dependency in conda directly, but because it's specifically targeted at CentOS 7, I'm not sure whether it's a reliable package for the general user, any thoughts here @jakirkham @raydouglass ?

One thing I noticed is that RAPIDS 21.08 wasn't pinned to UCX 1.9, which causes a new environment to pick UCX 1.11 (which wasn't supported back then), so if we still want to support RAPIDS <= 21.08, we must pin UCX 1.9 or instruct users to specify ucx=1.9. What do you think @raydouglass @quasiben ?

pseudotensor commented 2 years ago

@pentschev Ok, that's good to know. So rapids <=21.08 shouldn't be used with ucx1.11 then, I should go back to ucx1.9? I was also hit by this then, since the new conda solution upgraded ucx to 1.11 and I just assumed this was ok and was trying to resolve the libnuma issue to make that work.

pentschev commented 2 years ago

Ok, that's good to know. So rapids <=21.08 shouldn't be used with ucx1.11 then, I should go back to ucx1.9?

That's right.

I was also hit by this then, since the new conda solution upgraded ucx to 1.11 and I just assumed this was ok and was trying to resolve the libnuma issue to make that work.

No, this wasn't predicted. Recently we started pinning some libraries to a maximum version, I believe we should do the same with UCX.

jakirkham commented 2 years ago

In previous UCX 1.9 packages we didn't have that dependency due to the UCX bug above, and in new UCX 1.11 packages we do, as it's important for UCX in certain systems. Apparently, we could depend on https://anaconda.org/conda-forge/numactl-libs-cos7-x86_64 to resolve that dependency in conda directly, but because it's specifically targeted at CentOS 7, I'm not sure whether it's a reliable package for the general user, any thoughts here @jakirkham @raydouglass ?

No that's a CDT that is just using a vendored package from CentOS 7. It's only used at build time to make our build tooling happy. Would not use that at runtime. At present users should continue to install this from an OS system package manager.

pseudotensor commented 2 years ago

@jakirkham but that means the ucx package is not consistent with anything else in conda land. For no other packages do i have to install something on the OS natively separately except nvidia drivers. This this is quite a big awkward change. Means the conda setup is not self-contained like it should be.

jakirkham commented 2 years ago

I know it is not ideal.

Unfortunately libnuma is one of the easier dependencies to install. The others (like MOFED) rely on someone already installing the right libraries, drivers, etc. on the system and have everything configured correctly.

We are discussing internally to see if there are ways to improve the situation to make this less of a pain to deploy.

Edit: Also we raised this issue ( https://github.com/openucx/ucx/issues/4570 ) previously to discuss making libnuma optional at runtime.