pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.34k stars 640 forks source link

Python import parser does not support subprocess_environment or some other way to leak env vars. #16565

Open jsirois opened 2 years ago

jsirois commented 2 years ago

The code has never used a PexProcess or VenvPexProcess or manually plumbed SubprocessEnvironment or else its own bespoke env var leaking option: https://github.com/pantsbuild/pants/blob/37bf6071b9dea6074176cbc67e97cdb70fccb71d/src/python/pants/backend/python/dependency_inference/parse_python_dependencies.py#L88-L106

This results in situations where other Pants rules can use a given Python interpreter but the dependency inference rules cannot when those other rules have the benefit of --subprocess-environment-env-vars='+["LD_LIBRARY_PATH"]' - for example. This particular example comes up when trying to use GH actions provided Pythons where they are not statically linked Python and use a .so and that .so is not in the default linker cache / linker search path.

jsirois commented 2 years ago

The more interesting thing here @stuhood is that a data dependency (on a binary) has an implicit data dependency on the env vars (or at least 1 of them, LD_LIBRARY_PATH) exposed to the process that found the binary. Without that LD_LIBRARY_PATH satisfied, the downstream recipient of the found Python binary cannot use it. As things stand, humans carefully or un-knowingly arrange for the implicit dependency to be satisfied unlike most other things engine. Not sure if there is anything clever we can do here. If we engineered a mechanism to propagate env vars in a growing set down to dependent subprocesses, that would be too broad a brush in general although it would prevent this sort of issue in general. Perhaps we're stuck with being careful.

stuhood commented 2 years ago

The more interesting thing here @stuhood is that a data dependency (on a binary) has an implicit data dependency on the env vars (or at least 1 of them, LD_LIBRARY_PATH) exposed to the process that found the binary. Without that LD_LIBRARY_PATH satisfied, the downstream recipient of the found Python binary cannot use it. As things stand, humans carefully or un-knowingly arrange for the implicit dependency to be satisfied unlike most other things engine. Not sure if there is anything clever we can do here.

Yea, this is tricky. It's definitely not implicit that if Process 1 has consumed something from an environment variable, that Process 2 downstream will also need to consume it.

This results in situations where other Pants rules can use a given Python interpreter but the dependency inference rules cannot when those other rules have the benefit of --subprocess-environment-env-vars='+["LD_LIBRARY_PATH"]' - for example.

The tack that I have been trying to encourage for the "options which declare environment variables which need to be leaked" pattern is to attempt to make them use-case or tool specific: "all subprocesses" is just too broad.

This ticket represents a really good example of that I think: in this case, the tool that requires a leaked environment variable is the Python interpreter. So as a strawimpl, you could imagine having --python-interpreter-env-vars, which would be used in any situation where we expected to be able to use (a particular...?) Python interpreter.

And this relates to #12203 as well. Similar to a PATH entry, the filesystem content of the value behind the env-var value is relevant. So the LD_LIBRARY_PATH variable would either need to be:

  1. deeply fingerprinted in an oblivous way
  2. explicitly supported in some way (i.e.: we install logic which fingerprints the LD_LIBRARY_PATH in a "particular library version-aware way"... essentially a declaration that no other data behind that env var is relevant)
huonw commented 1 year ago

The original problem (with python import parsing) might be obsoleted by the new Rust-based parser https://github.com/pantsbuild/pants/pull/18854, once that fully replaces the Python-based one. Once that's the case, the docs in #18900 referencing this issue could probably be removed too.

That does nothing for the broader problem of propagating env var dependencies, of course.

kaos commented 1 year ago

Would it perhaps make sense to also capture which env vars to leak in the BinaryPath result? https://github.com/pantsbuild/pants/blob/a4436e3b427c91c6981a220a20b565b5598744e6/src/python/pants/core/util_rules/system_binaries.py#L46-L48

If the request indicates which env vars are potentially relevant, those could be leaked into the result if defined. Then a call site requesting a search for a Python binary could include LD_LIBRARY_PATH to be included in the resulting BinaryPath which may then be used in subsequent process requests for that binary. That way it's configurable per tool which env vars to plumb through for..

thejcannon commented 1 year ago

I think, for this case, what the root problem is is that plugin authors can ask for a Python executable, and in most (every?) case they shouldn't be using it. With that in mind maybe we make it harder/impossible to do so?

thejcannon commented 1 year ago

FYI I just fixed this for Django dep inference as well: https://github.com/pantsbuild/pants/pull/19008