Bug: Existence of system-wide version of a shared library causes `undefined symbol` error

Garbaz commented 1 month ago

To reproduce (assuming you have libnvjitlink12 installed system-wide, and in a different version):

library(reticulate)

venv_name <- "deleteme_5267"
virtualenv_create(venv_name)
use_virtualenv(venv_name)

py_install("torch", pip = true)

pytorch  <- import("torch")

The final line gives me this error:

Error in py_module_import(module, convert = convert) : 
  ImportError: /home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12

Checking nm -gDC ~/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12 | grep nvJitLinkAddData is get:

0000000000262eb0 T nvJitLinkAddData@@libnvJitLink.so.12
0000000000263070 T __nvJitLinkAddData_12_0@@libnvJitLink.so.12
0000000000263080 T __nvJitLinkAddData_12_1@@libnvJitLink.so.12
0000000000263090 T __nvJitLinkAddData_12_2@@libnvJitLink.so.12
00000000002630a0 T __nvJitLinkAddData_12_3@@libnvJitLink.so.12
00000000002630b0 T __nvJitLinkAddData_12_4@@libnvJitLink.so.12
00000000002630c0 T __nvJitLinkAddData_12_5@@libnvJitLink.so.12
00000000002630d0 T __nvJitLinkAddData_12_6@@libnvJitLink.so.12

So the version of libnvJitLink.so.12 in the virtualenv has the symbol. And if I activate the virtualenv normally in a shell and import torch from inside a normal Python REPL I don't get any errors. So it's not the fault of libcusparse.so.12.

The thing is though, the library libnvJitLink.so.12 is also installed system-wide, but in a different version. Checking there with nm -gDC /usr/lib/x86_64-linux-gnu/libnvJitLink.so.12 | grep nvJitLinkAddData, I get only:

0000000000226bd0 T __nvJitLinkAddData_12_0@@libnvJitLink.so.12

And when I remove the system-wide version of the library with

sudo apt remove libnvjitlink12:amd64

the error no longer occurs.

It appears to be that if there is a system-wide version of a shared library, it is preferred over the local version in the virtualenv. This is not how it things should be!

R version is 4.4.1 (2024-06-14) and reticulate version is reticulate_1.38.0.

Garbaz commented 1 month ago

To be clear, sudo apt remove libnvjitlink12:amd64 is not really a solution to this problem.

t-kalinowski commented 1 month ago

Thanks for reporting!

Are you using the RStudio IDE? Does this happen only in the RStudio IDE, or outside the IDE too?

Garbaz commented 1 month ago

Ah, I should have added that I'm using R Studio Server. And I should have tested running the repro code directly in R.

I don't have access to a machine at the moment where I can test running the code in normal R Studio Desktop, so I can't check whether it's a R Studio Server specific issue. But running source("repro.R"), where repro.R contains the repro code:

library(reticulate)

venv_name <- "deleteme_5267"
virtualenv_create(venv_name)
use_virtualenv(venv_name)

py_install("torch", pip = true)

pytorch  <- import("torch")

~~I do not get the error. And running e.g. pytorch$cuda$is_available() works as expected.~~

~~So it appears to be an interactive between R Studio (Server) and Reticulate that is the issue.~~

Garbaz commented 1 month ago

Wait, scratch that, I forgot I uninstalled libnvjitlink12 to temporarily fix the issue. Reinstalling it, I get the same error in plain R!

So it has nothing to do with R Studio (Server) in particular.

t-kalinowski commented 1 month ago

I don't think reticulate is modifying the order of loaded libs.

If this occurs with reticulate::import("torch") in R, but not in a terminal with ~/.virtualenvs/r-torch/bin/python -c 'import torch', then it's likely that something in the R session is either

Modifying LD_LIBRARY_PATH
Pre-loading the "wrong" libnvjitlink12 for some reason.

Can you please double-check the value of Sys.getenv("LD_LIBRARY_PATH") in R, and also, inspect other R startup files for code that might be causing this (.Rprofile, .Renviron, etc.)?

Garbaz commented 1 month ago

Both Sys.getenv("LD_LIBRARY_PATH") and os <- import("os"); os$environ["LD_LIBRARY_PATH"] give:

"/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server"

What I do find weird is that there is no mention of the virtualenv, even in os$environ["LD_LIBRARY_PATH"], even though, evidently, the libraries from the virtualenv are found.

Garbaz commented 1 month ago

Okay, it appears Python does not simply use the LD_LIBRARY_PATH environment variable. At least when I run os.environ["LD_LIBRARY_PATH"] in the normal python REPL (from the virtualenv), I get a key error.

However, Python does use an environment variable PYTHONPATH. Running os <- import("os"); os$environ["PYTHONPATH"] in R I get:

"/usr/local/lib/R/site-library/reticulate/config:/usr/lib/python312.zip:/usr/lib/python3.12:/usr/lib/python3.12/lib-dynload:/home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages:/usr/local/lib/R/site-library/reticulate/python"

I will investigate whether I can fix the issue by messing with PYTHONPATH.

Update: I have experimented with both LD_LIBRARY_PATH and PYTHONPATH and could not get the issue to go away. I will continue trying to figure this out later this week.

Garbaz commented 1 month ago

By the way, py_last_error() gives:

--- Python Exception Message
Traceback (most recent call last):
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 122, in _find_and_load_hook
    return _run_hook(name, _hook)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 96, in _run_hook
    module = hook()
             ^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 120, in _hook
    return _find_and_load(name, import_)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/torch/__init__.py", line 290, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 122, in _find_and_load_hook
    return _run_hook(name, _hook)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 96, in _run_hook
    module = hook()
             ^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 120, in _hook
    return _find_and_load(name, import_)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: /home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12
--- R Traceback
    ▆
 1. └─reticulate::import("torch")
 2.   └─reticulate:::py_module_import(module, convert = convert)
See `reticulate::py_last_error()$r_trace$full_call` for more details.

In case that's of any help.

t-kalinowski commented 1 month ago

I am unable to reproduce locally.

Note that PyTorch can be installed a few different ways, depending on your environment. You may want to consult https://pytorch.org/get-started/locally/ and see if there is something that will work better for you than a bare pip install torch (e.g., pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Garbaz commented 1 month ago

I do not think this has anything to do with torch in particular. The reason torch recommends using their bespoke pypi repo has to do with driver/CUDA version incompatibilities and shouldn't change anything about the issue here. I will however try this for completeness.

rstudio / reticulate

Bug: Existence of system-wide version of a shared library causes `undefined symbol` error #1640