tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
186.29k stars 74.31k forks source link

TF 2.15 fails to build with error "env: 'python3': No such file or directory" from bazel py_strict_library. #62497

Open trevor-m opened 11 months ago

trevor-m commented 11 months ago

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

2.15

Custom code

No

OS platform and distribution

Manylinux 2.28 (AlmaLinux 8) quay.io/pypa/manylinux_2_28

Mobile device

No response

Python version

3.10

Bazel version

6.1.0

GCC/compiler version

11.2.1

CUDA/cuDNN version

12.3

GPU model and memory

No response

Current behavior?

Compiling TF 2.15 from source fails with the error env: 'python3': No such file or directory coming from all instances of py_strict_library or pytype_strict_library.

Based on the commands bazel is issuing from --verbose_failures, bazel is using env - to clear the environment which causes it to be unable to find python3 since it is not longer in the path. For example:

[root@2ac299be24a6 tensorflow]# exec env - python3
env: _python3_: No such file or directory
[root@1aa528794d75 tensorflow]# env - bash -c 'which python3'
which: no python3 in ((null))
[root@1aa528794d75 tensorflow]# bash -c 'which python3'
/opt/python/v/bin/python3

Standalone code to reproduce the issue

It can be reproduced by building TF from source. I'm using the container `quay.io/pypa/manylinux_2_28`.

Relevant log output

ERROR: /opt/tensorflow/tensorflow-source/tensorflow/python/util/BUILD:383:18: Extracting tensorflow APIs for //tensorflow/python/util:tf_decorator to bazel-out/k8-opt/bin/tensorflow/python/util/tf_decorator_extracted_tensorflow_api.json. failed: (Exit 127): main failed: error executing command (from target //tensorflow/python/util:tf_decorator)
  (cd /root/.cache/bazel/_bazel_root/a8fc6d0749b4f3c43761726a36e8ec4c/execroot/org_tensorflow && \
  exec env - \
  bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/python/tools/api/generator2/extractor/main --output bazel-out/k8-opt/bin/tensorflow/python/util/tf_decorator_extracted_tensorflow_api.json --decorator tensorflow.python.util.tf_export.tf_export --api_name tensorflow tensorflow/python/util/tf_contextlib.py tensorflow/python/util/tf_decorator.py tensorflow/python/util/tf_inspect.py)
# Configuration: f8e9df02b24a37687b60048a360df004e0c5cb673a184a2d96618507db49ca2c
# Execution platform: @local_execution_config_platform//:platform
env: 'python3': No such file or directory
ERROR: /opt/tensorflow/tensorflow-source/tensorflow/python/distribute/BUILD:214:18: Extracting tensorflow APIs for //tensorflow/python/distribute:distribute_config to bazel-out/k8-opt/bin/tensorflow/python/distribute/distribute_config_extracted_tensorflow_api.json. failed: (Exit 127): main failed: error executing command (from target //tensorflow/python/distribute:distribute_config)
  (cd /root/.cache/bazel/_bazel_root/a8fc6d0749b4f3c43761726a36e8ec4c/execroot/org_tensorflow && \
  exec env - \
  bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/python/tools/api/generator2/extractor/main --output bazel-out/k8-opt/bin/tensorflow/python/distribute/distribute_config_extracted_tensorflow_api.json --decorator tensorflow.python.util.tf_export.tf_export --api_name tensorflow tensorflow/python/distribute/distribute_config.py)
# Configuration: f8e9df02b24a37687b60048a360df004e0c5cb673a184a2d96618507db49ca2c
# Execution platform: @local_execution_config_platform//:platform
env: 'python3': No such file or directory
Target //tensorflow/tools/pip_package:build_pip_package failed to build
ERROR: /opt/tensorflow/tensorflow-source/tensorflow/tools/pip_package/BUILD:255:10 Middleman _middlemen/tensorflow_Stools_Spip_Upackage_Sbuild_Upip_Upackage-runfiles failed: (Exit 127): main failed: error executing command (from target //tensorflow/python/util:tf_decorator)
  (cd /root/.cache/bazel/_bazel_root/a8fc6d0749b4f3c43761726a36e8ec4c/execroot/org_tensorflow && \
  exec env - \
  bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/python/tools/api/generator2/extractor/main --output bazel-out/k8-opt/bin/tensorflow/python/util/tf_decorator_extracted_tensorflow_api.json --decorator tensorflow.python.util.tf_export.tf_export --api_name tensorflow tensorflow/python/util/tf_contextlib.py tensorflow/python/util/tf_decorator.py tensorflow/python/util/tf_inspect.py)
# Configuration: f8e9df02b24a37687b60048a360df004e0c5cb673a184a2d96618507db49ca2c
# Execution platform: @local_execution_config_platform//:platform
INFO: Elapsed time: 2272.714s, Critical Path: 496.44s
INFO: 9446 processes: 242 internal, 9204 local.
FAILED: Build did NOT complete successfully
trevor-m commented 11 months ago

@angerson It seems like this might be related to hermetic python. Any thoughts?

haampie commented 11 months ago

Yeah, this is probably due to unsetting PATH while using a (generated?) shebang of the form #!/usr/bin/env python3.

It breaks build isolation since if it works it picks up system /bin/python3 or /usr/bin/python3 on Linux instead of the Python Tensorflow was instructed to use, see https://linux.die.net/man/3/execl:

The file is sought in the colon-separated list of directory pathnames specified in the PATH environment variable. If this variable isn't defined, the path list defaults to the current directory followed by the list of directories returned by confstr(_CS_PATH). (This confstr(3) call typically returns the value "/bin:/usr/bin".)

Unsetting PATH may be fine but then execute env - <absolute path to python interpreter> ./script.py instead of env - ./script.py, or use the absolute path in the shebang (but note that has downsides too since the relevant executable may be in a long path, and Linux has a shebang line limit).

angerson commented 11 months ago

What commands are you using to start a build? It's been working fine in our nightly and continuous tests on the tensorflow/build containers.

trevor-m commented 11 months ago

Hi @angerson, I'm using ./configure && bazel build -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=1 --java_runtime_version=remotejdk_11 tensorflow/tools/pip_package:build_pip_package.

This error only occurs with our manylinux build container which does not contain a python3 in the "unset PATH" directories that @haampie mentioned (/bin:/usr/bin/:/usr/local/bin. I believe the tensorflow/build containers are ubuntu based and will have a system python3 in one of those directories which is currently being inadvertently used for these build rules.

It breaks build isolation since if it works it picks up system /bin/python3 or /usr/bin/python3 on Linux instead of the Python Tensorflow was instructed to use, see https://linux.die.net/man/3/execl:

Yes, this appears to be exactly what's happening.

Unsetting PATH may be fine but then execute env - <absolute path to python interpreter> ./script.py instead of env - ./script.py, or use the absolute path in the shebang (but note that has downsides too since the relevant executable may be in a long path, and Linux has a shebang line limit).

This makes sense, I think this change needs to be in bazel? It sounds like py_binary should be setting up the command to use the hermetic python environment and it is not.

haampie commented 11 months ago

I bisected it to 539673ead2b66a9c2dce3fb90e3767efda5deef5

539673ead2b66a9c2dce3fb90e3767efda5deef5 is the first bad commit
commit 539673ead2b66a9c2dce3fb90e3767efda5deef5
Author: Marc Fisher II <fisherii@google.com>
Date:   Fri Sep 8 09:27:41 2023 -0700

    Switch to using new API generation.

 ci/official/wheel_test/test_import_api_packages.py |  1 +
 tensorflow/BUILD                                   | 58 ++++++++--------------
 .../python/tools/api/generator2/generate_api.bzl   | 52 +++++++++++++++++--
 3 files changed, 69 insertions(+), 42 deletions(-)

Ping @DrMarcII

I don't know bazel well enough to quickly see how to solve it, let's leave that to googlers ;p

haampie commented 11 months ago

Revert of 539673ead2b66a9c2dce3fb90e3767efda5deef5 applies cleanly to 2.15, but then the build fails with

ImportError: _pywrap_tensorflow_internal.so: cannot open shared object file: No such file or directory

so more is necessary. If someone could take over to fix it that'd be great.

If you want to reproduce, run mv /usr/bin/python3 /usr/bin/python3.tmp and do an ordinary build (with another python)

BrianWieder commented 11 months ago

I think that this may be related to https://github.com/bazelbuild/rules_python/issues/691. 539673ead2b66a9c2dce3fb90e3767efda5deef5 added an aspect that runs a py_binary on each py_library. It looks like the py_binary bootstrap script currently has an implicit dependency on a system interpreter being installed.

haampie commented 11 months ago

Can you set the shebang line to PYTHON_BIN_PATH? The linked issue mentions stubs for shebangs.

Or if possible: invoke the script directly $PYTHON_BIN_PATH script.py. This is more robust as it allows for longer paths to the python executable.

See https://www.in-ulm.de/~mascheck/various/shebang/#issues for reference.

the length of the #! is much smaller than the maximum path length

adamjstewart commented 10 months ago

Any updates on this? Would be great to be able to build TF without assuming that /usr/bin/python3 exists.

haampie commented 9 months ago

This affects multiple package managers that don't have a /usr/bin/python3:

and possibly others. Can someone have a look at it?