pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.32k stars 636 forks source link

`pex_binary` fails to locate `.so` files inside pex, `python_source` works #20205

Open tgolsson opened 11 months ago

tgolsson commented 11 months ago

Describe the bug

Users using pex_binary together with torch>=1.13 may encounter weird _dl_open errors in relation to CUDA libraries. The same lockfile, same code, works when using python_source instead. This indicates that there is some difference in how the pex and/or venv is built between these two cases that causes one to fail unexpectedly.

Example errors:

OSError: libnvJitLink.so.12: cannot open shared object file: No such file or directory

ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory

Workarounds

Pants version 2.18

OS Can only occur on Linux.

Additional info

Runnable repro here: https://github.com/tgolsson/pants-repros/tree/main/torch-2

Slack threads:

tgolsson commented 11 months ago

@jsirois this seems like a pex issue, based on importing the nvidia namespace package solving the issue. Do you have any insights to add? I'm going to guess the relevant files don't get extracted until something imports the nvidia package?

tgolsson commented 11 months ago

Unrelated to the primary issue here, anything pulling in these native libs also fails to run due to zipapp size limitations -- adding another fun layer of debugging why it can't find the __main__.py that's in the files. So execution_mode="venv" is also required.

jsirois commented 11 months ago

Using --venv is a well worn road at this point. Did you try that? I think this is execution_mode on pex_binary.

jsirois commented 11 months ago

Oh, you figured that out. So you answered your own question here it seems?

jsirois commented 11 months ago

@tgolsson in short, there are a lot of projects that should be using namespace packages but don't (correctly). The whole issue of namespace packages is sidestepped in a venv where there is effectively 1 sys.path entry. Since most users of most packages in the wild use venvs (or system packages with similar layout), these issues are often hidden until you use something like a zipapp execution mode PEX that uses many sys.path entries (1 per ~wheel). I assume this is the problem space here.

tgolsson commented 11 months ago

execution_mode="venv" on its own is not enough; the package also has to be imported. execution_mode="venv" is required due to the size, but it doesn't help find the .so.

jsirois commented 11 months ago

Unrelated to the primary issue here, anything pulling in these native libs also fails to run due to zipapp size limitations -- adding another fun layer of debugging why it can't find the main.py that's in the files. So execution_mode="venv" is also required.

Two things here:

jsirois commented 11 months ago

@tgolsson did you try execution_mode="venv" + https://www.pantsbuild.org/docs/reference-pex_binary#codevenv_site_packages_copiescode?

tgolsson commented 11 months ago

~Latest Pex will detect the size issue at build time and warn you (https://github.com/pantsbuild/pex/releases/tag/v2.1.148). Pants may be getting in the way here and hiding the warning or may not be using latest Pex or both.

That's great, we'll see how pants behaves. Not sure which version is in 2.18 at the moment.

The execution mode should have 0 to do with size limits. Only the layout could influence issues there. Execution mode and layout are orthogonal axes that can be exercised fully independently.

Ah, great. I tricked myself into thinking it'd have the size limits imposed by zipapp since that's what Pants calls the other option.

@tgolsson did you try execution_mode="venv" + https://www.pantsbuild.org/docs/reference-pex_binary#codevenv_site_packages_copiescode?

That also works, thanks. I don't quite understand why. It definitely finds libcufft at least once correctly even without either solution, but then fails. So I'm wondering if one method uses sys.path and another uses heuristics, and one of those fails.

jsirois commented 11 months ago

This is something I really hate about Pants. It hides everything. The best way to see what's going on is to use Pex directly with all these options and inspect the PEX_ROOT to see the layout differences. Packed & venv with symlinks are both hacks added to help Pants deal with its poorly performing sandboxing and space leaky behavior.

jsirois commented 11 months ago

So @tgolsson can this be closed as an answered question?

tgolsson commented 11 months ago

Gotcha. I can replicate the issue with the followinng command:

pex --venv \
    --layout $layout \
    -vvv \
    --python python3.9 \
    --lock locks/torch2.pexlock \
    --no-compress 'torch>=2' \
    -o cuda.pex \
    -D cmd/ \
    -m cuda

Interestingly, once I've ran that once with --venv-site-package-copies it keeps working even if I delete and recreate the cuda.pex directory. Before that, nothing works. Deleting ~/.pex breaks it again. Also, the import nvidia hack doesn't seem to affect anything in this situation. Furthermore, loose works immediately and then packed after also works. packed first then loose means both are broken (and remain so). So I don't think I quite understand what happens.

In both situations (loose vs packed first) the .so's are in the same location in cuda.pex, and look to be real files.

tgolsson commented 11 months ago

I.e,

But

jsirois commented 11 months ago

@tgolsson you're not using the equivalent --venv-site-packages-copies. I can't remember the default. In short though, nothing except --venv --venv-site-packages-copies should be expected to work with finicky libs since that combo, and only that combo, emulates a normal venv ~exactly.

tgolsson commented 11 months ago

As I noted a bit above using that setting from both pants and pex CLI seems to work, so not quite sure what you mean -- I'll check a bit more tomorrow and see if I'm missing something in the docs.

I think I confused both myself and this thread because I thought I was observing various fixes and non-fixes because of the "stickiness" of --layout packed and --layout loose when running the generated pex. I'd expect independent invocations to have independent outcomes, but it seems like I have to clean ~/.pex between switching them -- or use --venv-site-packages-copies. Maybe I'm looking for a link in the output when I should be investigating the layout of the cache and how that reacts to --layout?

jsirois commented 11 months ago

Yeah, the layout (build time structure ) should be a completely independent variable from the --venv related flags. Those flags dictate the runtime structure in the ~/.pex/ cache.

tgolsson commented 11 months ago

Great, thanks for your patience. I probably understand even less of pex actual functionality than I thought.

So what I'm seeing is that two different layouts end up in the same cache position, but they behave differently. I think that gives me a much better idea of how to investigate this further. On reading back; I realize I misread "PEX_ROOT" as the root of the pex, whereas it is the "cache directory" it runs from.

jsirois commented 11 months ago

Ok. When you do get around to sussing out the details, just keep in mind that it is expected that the nvidia packages only work when both --venv and --venv-site-packages-copies are specified. For any other arrangement, if things work at all - its complete luck; so that is not really fixable. The only answer is to use both of those settings because nvidia packages things wrong here and presents multiple distributions with a top-level nvidia/__init__.py that is empty. That ensures a 1st imported wins and now owns the nvidia package. No other dists subpackages of nvidia will be seen. Instead, nvidia should either be using PEP-420 implicit namespace packages (no nvidia/__init__.py in each dist, just nvidia/<subpackage>/...) or else a non-empty nvidia/__init__.py that has the magic namespace declaration lines.

jsirois commented 11 months ago

@tgolsson I'm still a bit confused where this issue stands. Let me know if anything is unclear or you have a more explicit failing repro with full pex command lines and step sequence ordering.

tgolsson commented 11 months ago

I haven't had time to investigate this more. I think in the interest of improving torch support I want to act on it somehow, but I don't quite know where to start - diagnostics, special handling, ...? No tool has solved it it well so far, and the broken Nvidia packages now add fuel to that fire.

I'm not sure if there's anything Pex can do beyond maybe detecting and barfing on malformed namespaces? On the other hand, I'm also not sure how pex + dlopen could ever work without --venv at a minimum. I guess for most .so's that occur in a Pex the Python import machinery the native component gets loaded correctly.

jsirois commented 11 months ago

I'm not sure if there's anything Pex can do beyond maybe detecting and barfing on malformed namespaces?

I'm not sure. I already bend over backwards for pkg_resources-style namespace packages with no setuptools dep by providing that dep and warning. Let me see if anything could be done for the nvidia case and report back.

On the other hand, I'm also not sure how pex + dlopen could ever work without --venv at a minimum.

Even when a PEX is fully (default) or partially (--layout packed) zipped at build time, at runtime, the only code that ever runs out of the zip / through a zipimporter is the PEX .bootstrap/. That code immediately unzips the PEX under the PEX_ROOT if it hasn't already and re-execs from there (possibly re-execing a second time if it needs to create a --venv under the PEX_ROOT). As such, user code and third party code are always run unzipped no matter the PEX build time layout or run time layout. The easiest way to see this is to perhaps run a PEX with PEX_VERBOSE=1. The modified sys.path will be printed towards the end of the boot process. For example:

# Using Pex latest:
$ pex -V
2.1.153

# Setup an app with 1st and 3rd party code:
$ cat src/exe.py
import cowsay; cowsay.tux("Moo!")
$ pex -Dsrc -mexe cowsay -oexample.pex
$ pex -Dsrc -mexe cowsay --layout packed -oexample-packed.pex
$ pex -Dsrc -mexe cowsay --layout loose -oexample-loose.pex
$ pex -Dsrc -mexe cowsay --venv -oexample-venv.pex
$ pex -Dsrc -mexe cowsay --layout packed --venv -oexample-packed-venv.pex
$ pex -Dsrc -mexe cowsay --layout loose --venv -oexample-loose-venv.pex

# Observe re-execs and final runtime sys.path layout - all entries of interest loose (not zipped):
$ PEX_VERBOSE=1 ./example.pex
pex: Laying out PEX zipfile /tmp/example/example.pex: 24.4ms
pex:   Installing /tmp/example/example.pex to /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1: 22.8ms
pex: Executing installed PEX for /tmp/example/./example.pex at /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1
pex:   Testing /tmp/example/example.venv/bin/python can resolve PEX at /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1: 2.3ms
pex: Re-executing: cmdline='/usr/bin/python3.10 /tmp/example/./example.pex', sys.executable='/tmp/example/example.venv/bin/python3.10', PEX_PYTHON=None, PEX_PYTHON_PATH=None, interpreter_constraints=InterpreterConstraints(constraints=())
pex: Laying out PEX zipfile /tmp/example/example.pex: 0.1ms
pex: Executing installed PEX for /tmp/example/./example.pex at /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1
...
pex: Activating PEX virtual environment from /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1: 2.0ms
pex: Bootstrap complete, performing final sys.path modifications...
pex: PYTHONPATH contains:
pex:     /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1
pex:   * /usr/lib/python310.zip
pex:     /usr/lib/python3.10
pex:     /usr/lib/python3.10/lib-dynload
pex:     /home/jsirois/.pex/installed_wheels/274b1e6fc1b966d53976333eb90ac94cb07a450a700b455af9fbdf882244b30a/cowsay-6.1-py3-none-any.whl
pex:     /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1/.bootstrap
pex:   * - paths that do not exist or will be imported via zipimport
  ____
| Moo! |
  ====
         \
          \
           \
            .--.
           |o_o |
           |:_/ |
          //   \ \
         (|     | )
        /'\_   _/`\
        \___)=(___/
$ ls -l /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1
total 24
-rw-r--r-- 1 jsirois jsirois  758 Nov 24 07:40 PEX-INFO
-rw-r--r-- 1 jsirois jsirois    6 Nov 24 07:40 PEX-LAYOUT
-rwxr-xr-x 1 jsirois jsirois 3774 Nov 24 07:40 __main__.py
lrwxrwxrwx 1 jsirois jsirois   64 Nov 24 07:40 __pex__ -> ../../user_code/72ac3fc418bf3e1ef42332dc4511aa806cef3c09/__pex__
drwxr-xr-x 2 jsirois jsirois 4096 Nov 24 07:40 __pycache__
lrwxrwxrwx 1 jsirois jsirois   63 Nov 24 07:40 exe.py -> ../../user_code/72ac3fc418bf3e1ef42332dc4511aa806cef3c09/exe.py
$ ls -l /home/jsirois/.pex/installed_wheels/274b1e6fc1b966d53976333eb90ac94cb07a450a700b455af9fbdf882244b30a/cowsay-6.1-py3-none-any.whl
total 8
drwxr-xr-x 4 jsirois jsirois 4096 Nov 24 07:40 cowsay
drwxr-xr-x 2 jsirois jsirois 4096 Nov 24 07:38 cowsay-6.1.dist-info

# And notice a packed build time layout ends up executing in the exact same runtime layout / sys.path configuation.
$ PEX_VERBOSE=1 ./example-packed.pex/__main__.py
pex: Laying out Spread PEX directory /tmp/example/example-packed.pex: 0.1ms
pex: Executing installed PEX for /tmp/example/./example-packed.pex at /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1
pex:   Testing /tmp/example/example.venv/bin/python can resolve PEX at /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1: 1.8ms
pex: Re-executing: cmdline='/usr/bin/python3.10 /tmp/example/./example-packed.pex', sys.executable='/tmp/example/example.venv/bin/python3.10', PEX_PYTHON=None, PEX_PYTHON_PATH=None, interpreter_constraints=InterpreterConstraints(constraints=())
pex: Laying out Spread PEX directory /tmp/example/example-packed.pex: 0.1ms
pex: Executing installed PEX for /tmp/example/./example-packed.pex at /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1
...
pex: Activating PEX virtual environment from /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1: 1.9ms
pex: Bootstrap complete, performing final sys.path modifications...
pex: PYTHONPATH contains:
pex:     /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1
pex:   * /usr/lib/python310.zip
pex:     /usr/lib/python3.10
pex:     /usr/lib/python3.10/lib-dynload
pex:     /home/jsirois/.pex/installed_wheels/274b1e6fc1b966d53976333eb90ac94cb07a450a700b455af9fbdf882244b30a/cowsay-6.1-py3-none-any.whl
pex:     /home/jsirois/.pex/unzipped_pexes/f8289db88a810cde8b5194be3fcb7f65c674cea1/.bootstrap
pex:   * - paths that do not exist or will be imported via zipimport
  ____
| Moo! |
  ====
         \
          \
           \
            .--.
           |o_o |
           |:_/ |
          //   \ \
         (|     | )
        /'\_   _/`\
        \___)=(___/
jsirois commented 11 months ago

So, using your example repro repo lock file I download two of the nvidia wheels and find:

$ zipinfo nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl | head -4
Archive:  nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl
Zip file size: 14109015 bytes, number of entries: 45
-rw-r--r--  2.0 unx        0 b- defN 23-Apr-04 01:04 nvidia/__init__.py
-rw-r--r--  2.0 unx        0 b- defN 23-Apr-04 01:04 nvidia/cuda_cupti/__init__.py
$ zipinfo nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl | tail -7
-rw-r--r--  2.0 unx   912728 b- defN 23-Apr-04 01:04 nvidia/cuda_cupti/lib/libpcsamplingutil.so
-rw-r--r--  2.0 unx    59262 b- defN 23-Apr-04 01:04 nvidia_cuda_cupti_cu12-12.1.105.dist-info/License.txt
-rw-r--r--  2.0 unx     1553 b- defN 23-Apr-04 01:04 nvidia_cuda_cupti_cu12-12.1.105.dist-info/METADATA
-rw-r--r--  2.0 unx      106 b- defN 23-Apr-04 01:04 nvidia_cuda_cupti_cu12-12.1.105.dist-info/WHEEL
-rw-r--r--  2.0 unx        7 b- defN 23-Apr-04 01:04 nvidia_cuda_cupti_cu12-12.1.105.dist-info/top_level.txt
?rw-rw-r--  2.0 unx     4508 b- defN 23-Apr-04 01:04 nvidia_cuda_cupti_cu12-12.1.105.dist-info/RECORD
45 files, 45439517 bytes uncompressed, 14101613 bytes compressed:  69.0%

$ zipinfo nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl | head -4
Archive:  nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl
Zip file size: 23671734 bytes, number of entries: 12
-rw-r--r--  2.0 unx        0 b- defN 23-Apr-04 04:06 nvidia/__init__.py
-rw-r--r--  2.0 unx        0 b- defN 23-Apr-04 04:06 nvidia/cuda_nvrtc/__init__.py
$ zipinfo nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl | tail -7
-rw-r--r--  2.0 unx 56875328 b- defN 23-Apr-04 04:06 nvidia/cuda_nvrtc/lib/libnvrtc.so.12
-rw-r--r--  2.0 unx    59262 b- defN 23-Apr-04 04:06 nvidia_cuda_nvrtc_cu12-12.1.105.dist-info/License.txt
-rw-r--r--  2.0 unx     1507 b- defN 23-Apr-04 04:06 nvidia_cuda_nvrtc_cu12-12.1.105.dist-info/METADATA
-rw-r--r--  2.0 unx      106 b- defN 23-Apr-04 04:06 nvidia_cuda_nvrtc_cu12-12.1.105.dist-info/WHEEL
-rw-r--r--  2.0 unx        7 b- defN 23-Apr-04 04:06 nvidia_cuda_nvrtc_cu12-12.1.105.dist-info/top_level.txt
?rw-rw-r--  2.0 unx     1109 b- defN 23-Apr-04 04:06 nvidia_cuda_nvrtc_cu12-12.1.105.dist-info/RECORD
12 files, 63813767 bytes uncompressed, 23669828 bytes compressed:  62.9%

So neither declares namespace_packages.txt but both claim nvidia/__init__.py with a 0-byte file. As such, in isolation, they are perfectly fine and normal packages. You can only detect the issue when you have the final list of packages being used together. If these two are in the list and housed in separate sys.path entries, then, and only then, can you say there will be an issue. I think that means the PEX runtime would have to do something like:

  1. Perform layout and sys.path setup as it does today during PEX boot.
  2. Just before handing off control to the user entry point, scan both the user code and 3rdparty sys.path entries for packages, categorizing each as one of: A) no __init__.py, B) __init__.py with pkgresources namespace decl + possibly other content, C) __init__.py with pkgutil namespace decl + possibly other content, D) __init__.py with no namespace decl
  3. Fail fast for any package with multiple entries any one of which is type D or else if not all of which match as all A, all B or all C.

Its not immediately clear if I can do this robustly, but it seems possible - i.e. I will never fail fast due to a false positive, which would be horrible.

This is at runtime though; and so the failure would have happened anyhow in a handful of milliseconds later; just less scrutably. I know of no great way to do this analysis at build time. It really requires knowing which interpreter will run the PEX and thus being able to pre-calc the subset of activated dists in the PEX (PEXes are fundamentally multiplatform and, so far, I do not special case single platform, single interpreter PEXes). So it really is a question if time tradeoff + complexity + nicheness of adding this runtime check is all worth it. Clearly the best answer is fixing Nvidia packages. Even better is fixing PyPA such that these packaging errors are not possible. Neither is likely to happen. So you're left with personal knowledge / debugging which I always favor as a 1st line of defense and finally having an individual tool like Pex try to impart that personal knowledge in a canned way. Since using a non-venv mode PEX itself is niche - you only ever want to do this if 1st run cold startup time is important - and the nvidia bad behavior is a compounding layer of niche bad behavior and you need both layers to be here - my inclination is to not prioritize adding this runtime checking to Pex. I'm not opposed to adding support for it as an option though if you or someone else is willing to invest the time and effort.

kaos commented 11 months ago

I'm currently pondering on a feature for pants to deduce module mappings based off of a lockfile, and as such I think it would be possible to analyze this scenario and warn about packages with conflicting non-namespace modules. (for the current platform at least.)

jsirois commented 11 months ago

@kaos have you determined how to avoid false positives? I.E.: the case where the artifacts that could be a problem never are because they are never actually used together? I think Pants tends to favor warning falsely over the annoyance to folks that know what they're doing; so perhaps it's not a concern. It is a concern for me with Pex.

kaos commented 11 months ago

@jsirois ah, good point, I'll keep that in mind. I certainly don't want pants to warn unless it's a real thing. I also haven't thought much about this particular use case as I just now made the connection here on this issue. But as pants will be aware of the set of artifacts used together, I think we could get this down pretty well, and if false positives is a problem to get to grips with one alternative could be to offer it in a kind of "run diagnostics mode" where it doesn't warn, merely points out potential issues for the user to consider when explicitly asked to do so.

jsirois commented 11 months ago

Yeah, an explicit diagnostic mode makes sense. Knowing which artifacts are used together though is hard. You really do have to perform a runtime equivalent evaluation of environment markers and requires Python clauses (transitively) to get the actual runtime set that will be used together. And you will have to do that for every possible runtime set even when there is no local representative.

kaos commented 11 months ago

Knowing which artifacts are used together though is hard.

Right. I completely overlooked transitive dependencies.

huonw commented 7 months ago

(I've updated the original description with the specific errors, including an example of something similar from a recent thread https://pantsbuild.slack.com/archives/C046T6T9U/p1710186504531719, as well as referencing the --venv --venv-site-packages-copies work around.)

tgolsson commented 7 months ago

Thanks @huonw!