google-deepmind / launchpad

Apache License 2.0
309 stars 35 forks source link

Feature request: Support Python installed in non-system locations natively. #25

Open ethanluoyc opened 2 years ago

ethanluoyc commented 2 years ago

Hi,

Thank you so much for providing this awesome library. I have been using it extensively with Acme and Reverb and found it great for my daily research.

Right now when using launchpad and reverb in Python not installed via the OS's package manager (for example, using Conda), both Launchpad and Reverb will throw an error. Here is an example of the stack trace tested with launchpad 0.5.0.

python -c "import launchpad"
# Traceback (most recent call last):
#  File "<string>", line 1, in <module>
#  File "/opt/conda/lib/python3.8/site-packages/launchpad/__init__.py", line 36, in <module>
#   from launchpad.nodes.courier.node import CourierHandle
#  File "/opt/conda/lib/python3.8/site-packages/launchpad/nodes/courier/node.py", line 21, in <module>
#    import courier
#  File "/opt/conda/lib/python3.8/site-packages/courier/__init__.py", line 26, in <module>
#    from courier.python.client import Client  # pytype: disable=import-error
#  File "/opt/conda/lib/python3.8/site-packages/courier/python/client.py", line 30, in <module>
#    from courier.python import py_client
# ImportError: libpython3.8.so.1.0: cannot open shared object file: No such file or directory

This happens because the location where Conda installs a copy of libpython3.8.so.1.0 is in /opt/conda/lib which is not in the linker's default path for searching the shared libraries. Tweaking the LD_LIBRARY_PATH to be /opt/conda/lib would fix the problem, but this can create a problem for new users who want to install launchpad to start using it right away. This issue has been reported previously https://github.com/tensorflow/agents/issues/724 https://github.com/deepmind/acme/issues/47 and the solution is to tweak the LD_LIBRARY_PATH.

Looking closer at the shared library produced by bazel, it seems that the rpath does not include a relative directory to the lib directory of Conda. Looking into the shared library created

readelf --dynamic /opt/conda/lib/python3.8/site-packages/courier/python/py_client.so
# 0x000000000000001d (RUNPATH)            Library runpath: [$ORIGIN/../../_solib_k8/_U_S_Scourier_Ccourier_Uservice_Ucc_Uproto___Ucourier:$ORIGIN/../../_solib_k8/_U_S_Scourier_Sserialization_Cserialization_Ucc_Uproto___Ucourier_Sserialization:$ORIGIN/../../_solib_k8/_U@tensorflow_Usolib_S_S_Cframework_Ulib___Utensorflow_Usolib_Stensorflow_Usolib:$ORIGIN/../../_solib_k8/_U@python_Uincludes_S_S_Cpython_Uincludes___Upython_Uincludes:$ORIGIN/:$ORIGIN/..]

Tweaking the LD_LIBRARY_PATH is perhaps not really ideal. I have been looking into if it is possible to update the BUILD files to fix this problem, but my limited knowledge of Bazel does not get me very far. The closest place which I believe would need update is https://github.com/deepmind/launchpad/blob/3a7d71d32a4f961b5835cc4256ea461880a6bc10/launchpad/build_defs.bzl#L427 and https://github.com/deepmind/launchpad/blob/3a7d71d32a4f961b5835cc4256ea461880a6bc10/launchpad/build_defs.bzl#L499

Perhaps including an additional level up in generating the rpath should fix the problem, but I am to sure if doing so would have an adverse effect. In contrast, I have to encountered this issue using either tensorflow or jax (by pip installing them into my conda environment instead of using the conda package manager) which also uses pybind so I suspect that they must have done something to solve this. I have also cross-referenced with https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorflow.bzl but I couldn't figure out what they did differently. I found the way TF sets up pybind extensions is very similar to what's done in Launchpad so I don't know why there is a difference in behavior. JAX uses TF's build defs for the pybind extensions so I assume the solution to fix this problem should exist in TF's Bazel configuration.

It would be great if both Launchpad and Reverb can support this use case by default. If you can point me in the right direction I am also happy to create a PR to resolve this.

Thanks!