nv-legate / legate.core

The Foundation for All Legate Libraries
https://docs.nvidia.com/legate/24.06/
Apache License 2.0
186 stars 61 forks source link

Allow runtime control of which libpython to use #945

Closed tylerjereddy closed 3 days ago

tylerjereddy commented 4 months ago

Working on branch branch-24.03 and commit 90944d721. The docs at https://nv-legate.github.io/legate.core/BUILD.html#building-through-install-py indicate:

The Legate Core repository comes with a helper install.py script in the top-level directory, that will build the C++ parts of the library and install the C++ and Python components under the currently active Python environment.

I'm not sure that the part about the "currently active Python environment" is true--install.py appears to ignore my "currently active" conda environment because of the shebang at the top of the file.

For example, if I'm in a Python 3.11 conda environment, it will error out if I do ./install.py with:

``` Traceback (most recent call last): File "/home/treddy/github_projects/legate.core/./install.py", line 25, in from distutils import sysconfig ModuleNotFoundError: No module named 'distutils' ```

That happened because env python3 is version 3.12 on my Linux box, and is completely separate from the conda env in use.

To make matters worse, if I try forcing it to use my conda version of Python it claims to do so and succeeds at building from source:

``` (cunumeric) treddy@gp160:~/github_projects/legate.core$ python install.py Verbose build is off Using python lib and version: /home/treddy/miniforge3/envs/cunumeric/lib/libpython3.11.so, 3.11.0 ```

but then if I try to get an interpreter from legate, it goes back to using the shebang (or some other) version despite the successful build (notice version 3.12 in the .so):

``` [0 - 7feeabff2c00] 0.017116 {6}{python}: libpython not loaded, dlerror: /home/treddy/lib/libpython3.12.so: cannot open shared object file: No such file or directory Signal 6 received by node 0, process 2385905 (thread 7feeabff2c00) - obtaining backtrace Signal 6 received by process 2385905 (thread 7feeabff2c00) at: stack trace: 10 frames [0] = pthread_kill@@GLIBC_2.34 at ./nptl/pthread_kill.c:44 [00007fefb4a969fc] [1] = raise at ../sysdeps/posix/raise.c:26 [00007fefb4a42475] [2] = abort at ./stdlib/abort.c:79 [00007fefb4a287f2] [3] = Realm::PythonInterpreter::PythonInterpreter() at unknown file:0 [00007fefb534164c] [4] = Realm::LocalPythonProcessor::create_interpreter() at unknown file:0 [00007fefb5341e48] [5] = Realm::PythonThreadTaskScheduler::python_scheduler_loop() at unknown file:0 [00007fefb534409b] [6] = Realm::KernelThread::pthread_entry(void*) at unknown file:0 [00007fefb532c303] [7] = start_thread at ./nptl/pthread_create.c:442 [00007fefb4a94ac2] [8] = __clone3 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 [00007fefb4b26a3f] [9] = unknown symbol at unknown file:0 [ffffffffffffffff] ```

If I print sys.version from /home/treddy/miniforge3/envs/cunumeric/bin/legate I nonetheless see the one I want to use: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0], which makes me wonder why the legate launcher is using a separate CPython from the the one reported in the traceback. The CMake cache file with Python version seems to correctly point to the 3.11 version from my conda env when I try to force use of that version.

This is a bit confusing perhaps, it might be helpful to get improved fidelity to the CPython version used in the enclosing env and/or to be able to respect a calling version of Python for the install.

manopapad commented 4 months ago

install.py appears to ignore my "currently active" conda environment because of the shebang at the top of the file.

That happened because env python3 is version 3.12 on my Linux box, and is completely separate from the conda env in use.

I don't believe I've ever seen this behavior in the past. I just tested various scenarios on my local machine, and the behavior appears to be consistent; no matter what I do the correct cpython executable is invoked.

() iblis:~> cat a.py
#!/usr/bin/env python3
import sys
print(sys.version)

before I load conda:

() iblis:~> which python
/usr/bin/python
() iblis:~> which python3
/usr/bin/python3
() iblis:~> ./a.py
3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
() iblis:~> python a.py
3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
() iblis:~> python3 a.py
3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

after I load conda, before I activate an environment:

() iblis:~> eval "$("$DEV/mambaforge/bin/conda" shell.bash hook)"
(base) iblis:~> which python
/home/mpapadakis/mambaforge/bin/python
(base) iblis:~> which python3
/home/mpapadakis/mambaforge/bin/python3
(base) iblis:~> ./a.py
3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
(base) iblis:~> python a.py
3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
(base) iblis:~> python3 a.py
3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]

after I activate an environment with python 3.12:

(test) iblis:~> which python
/home/mpapadakis/mambaforge/envs/test/bin/python
(test) iblis:~> which python3
/home/mpapadakis/mambaforge/envs/test/bin/python3
(test) iblis:~> ./a.py
3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]
(test) iblis:~> python a.py
3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]
(test) iblis:~> python3 a.py
3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]

Could you perhaps try the same and point out where the behavior diverges?

Or perhaps we're not following best practices by using the #!/usr/bin/env python3 shebang; @bryevdv do you know?

tylerjereddy commented 4 months ago

Before conda:

treddy@gp160:~/rough_work/cunumeric$ which python
treddy@gp160:~/rough_work/cunumeric$ which python3
/home/treddy/bin/python3
treddy@gp160:~/rough_work/cunumeric$ ./a.py 
3.12.0b4 (main, Jul 25 2023, 17:20:14) [GCC 11.3.0]
treddy@gp160:~/rough_work/cunumeric$ python a.py
Command 'python' not found, did you mean:
  command 'python3' from deb python3
  command 'python' from deb python-is-python3
treddy@gp160:~/rough_work/cunumeric$ python3 a.py
3.12.0b4 (main, Jul 25 2023, 17:20:14) [GCC 11.3.0]

Conda base:

(base) treddy@gp160:~/rough_work/cunumeric$ which python
/home/treddy/miniforge3/bin/python
(base) treddy@gp160:~/rough_work/cunumeric$ which python3
/home/treddy/bin/python3
(base) treddy@gp160:~/rough_work/cunumeric$ ./a.py 
3.12.0b4 (main, Jul 25 2023, 17:20:14) [GCC 11.3.0]
(base) treddy@gp160:~/rough_work/cunumeric$ python a.py 
3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
(base) treddy@gp160:~/rough_work/cunumeric$ python3 a.py
3.12.0b4 (main, Jul 25 2023, 17:20:14) [GCC 11.3.0]

conda 24.3.0

I strongly prefer to use venvs for developing. For my standard venv I see:

treddy@gp160:~$ source ~/python_venvs/py_311_scipy_dev/bin/activate
(py_311_scipy_dev) treddy@gp160:~$ which python
/home/treddy/python_venvs/py_311_scipy_dev/bin/python
(py_311_scipy_dev) treddy@gp160:~$ which python3
/home/treddy/python_venvs/py_311_scipy_dev/bin/python3

Anyway, regardless of whether my system settings are "asking for trouble," it seems like it might be helpful to have a workaround.

manopapad commented 2 months ago

We have confirmed that the installed legate script will just use the python binary from the conda environment it's in. The real issue is that the Realm python module is getting confused by the presence of /home/treddy/bin/python3. That module tries to find the appropriate libpython.so, so it can start an embedded interpreter. This detection code hardcodes which executable to try https://github.com/StanfordLegion/legion/blob/stable/runtime/realm/python/python_module.cc#L136. It essentially uses whatever python3 executable is first in the $PATH, and queries it about the location of libpython.so. On @tylerjereddy's machine that returns /home/treddy/lib/libpython3.12.so, which is not a valid file.

https://gitlab.com/StanfordLegion/legion/-/merge_requests/1360 proposes adding a backdoor, to allow the user to override this setting through an envvar, for cases where the detection goes wrong.

This will also be solved automatically after the planned move away from legion_python.

tylerjereddy commented 2 months ago

@manopapad thank you!

manopapad commented 2 months ago

https://gitlab.com/StanfordLegion/legion/-/merge_requests/1360 was merged, but it will take a few days for the change to percolate down to the cuNumeric builds, I would expect this to be available in ~2 weeks on a weekly cuNumeric build (we're still getting set up with weekly package builds).

manopapad commented 3 days ago

The fix for this has been included in the 24.06.01 patch release. You can now use REALM_PYTHON_LIB to set the location of the libpython.so.