tensorflow / java

Java bindings for TensorFlow
Apache License 2.0
785 stars 193 forks source link

arm64 and x86_64 linux: TF java full native builds are failing to find the native headers #544

Open snadampal opened 3 weeks ago

snadampal commented 3 weeks ago

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

Describe the problem TensorFlow java source builds are failing on aarch64 linux system with the missing native headers. please let me know how it's built for x86_64 linux platform.

based on my debugging so far it looks like the dependency comes from this commit which added C API extension for custom gradient functions, and introduced these headers and .cc which requires several third_party libraries from tensorflow native but none of those bazel workspaces are cloned.


tensorflow-core/tensorflow-core-native/src/main/native/org/tensorflow/internal/c_api$ 
tfj_gradients.h  tfj_gradients_impl.cc  tfj_graph.h  tfj_graph_impl.cc  tfj_scope.h  tfj_scope_impl.cc

I tried to manually clone the missing workspaces into bazel cache, but the cycle is never ending, it's missing tsl, eigen, ml_dtype, absl, protobuf, and now compiled headers for protobuf....

Provide the exact sequence of commands / steps that you executed before running into the problem

sudo apt-get install pkg-config ccache clang ant python3-pip swig git file wget unzip tar bzip2 gzip patch autoconf-archive autogen automake make cmake libtool bison flex perl nasm curl gfortran libasound2-dev freeglut3-dev libgtk2.0-dev libusb-dev zlib1g libffi-dev libbz2-dev zlib1g-dev

sudo apt install maven default-jdk

cd $HOME
mkdir bazel
cd bazel
wget https://github.com/bazelbuild/bazel/releases/download/6.5.0/bazel-6.5.0-linux-arm64
mv bazel-6.5.0-linux-arm64 bazel
chmod a+x bazel
export PATH=/home/ubuntu/bazel/:$PATH

# Build and install javacpp-presets.
# Clone the following forked repo to exclude the libraries that are not supported and not required
git clone https://github.com/snadampal/javacpp-presets.git
cd javacpp-presets
git checkout tfjava_aarch64
mvn install -Djavacpp.platform=linux-arm64 -Dmaven.javadoc.skip=true -X -T 16

# Build and install tensorflow java bindings
git clone https://github.com/tensorflow/java.git
cd java
git checkout v1.0.0-rc.1
mvn install -P native-build -Dbazel.build.flags='--verbose_failures -s --config=mkl_aarch64_threadpool' -X

Any other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

snadampal commented 3 weeks ago

The issue is not specific to arm64, I see the same missing headers issue even on the other platforms, at least I have reproduced it on linx-x86_64 as well, with Ubuntu 22.04 OS. From the code it looks like it happens on every platform. I have root-caused the issue to the fact that the dist_download step is skipped for the native build, but the dist_download is the one setting up all the required native headers for the javacpp build. the non native build is working fine because dist_download step executes there.


            <!--
              Download TensorFlow native libraries
                This will download the official Python distribution for the active platform, and extract the `tensorflow_cc` library
                from it so that we can generate the JavaCPP API bindings and distribute it as a JAR. This will be executed only
                when not building a full native build.
            -->
            <id>dist-download</id>
            <phase>initialize</phase>
            <goals>
              <goal>exec</goal>
            </goals>
            <configuration>
              <skip>${dist.download.skip}</skip> <!-- skipped when full native build is enabled -->
              <executable>bash</executable>
              <arguments>
                <argument>scripts/dist_download.sh</argument>
                <argument>${dist.download.folder}</argument>
              </arguments>
              <environmentVariables>
                <PLATFORM>${native.classifier}</PLATFORM>
              </environmentVariables>
              <workingDirectory>${project.basedir}</workingDirectory>
            </configuration>
          </execution>
        </executions>
      </plugin>

The backtrace:

[INFO] g++ -I/home/ubuntu/java/tensorflow-core/tensorflow-core-native/src/main/native/org/tensorflow/internal/c_api -I/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow -I/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow/third_party/xla/third_party/tsl -I/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/execroot/tensorflow_java/bazel-out/k8-opt/bin/external/org_tensorflow -I/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/com_google_protobuf/src -I/usr/lib/jvm/java-11-openjdk-amd64/include -I/usr/lib/jvm/java-11-openjdk-amd64/include/linux /home/ubuntu/java/tensorflow-core/tensorflow-core-native/target/native/org/tensorflow/internal/c_api/linux-x86_64/jnitensorflow.cpp /home/ubuntu/java/tensorflow-core/tensorflow-core-native/target/native/org/tensorflow/internal/c_api/linux-x86_64/jnijavacpp.cpp -march=x86-64 -m64 -O3 -s -std=c++17 -Wl,-rpath,$ORIGIN/ -Wl,-z,noexecstack -Wl,-Bsymbolic -Wall -fPIC -pthread -shared -o libjnitensorflow.so -L/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/execroot/tensorflow_java/bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow -Wl,-rpath,/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/execroot/tensorflow_java/bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow -ltensorflow_framework -ltensorflow_cc 
In file included from /home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow/third_party/xla/third_party/tsl/tsl/c/tsl_status_internal.h:19,
                 from /home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow/tensorflow/c/tf_status_internal.h:19,
                 from /home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow/tensorflow/c/c_api_internal.h:32,
                 from /home/ubuntu/java/tensorflow-core/tensorflow-core-native/src/main/native/org/tensorflow/internal/c_api/tfj_graph_impl.cc:18,
                 from /home/ubuntu/java/tensorflow-core/tensorflow-core-native/src/main/native/org/tensorflow/internal/c_api/tfj_graph.h:31,
                 from /home/ubuntu/java/tensorflow-core/tensorflow-core-native/target/native/org/tensorflow/internal/c_api/linux-x86_64/jnitensorflow.cpp:115:
/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow/third_party/xla/third_party/tsl/tsl/platform/status.h:28:10: fatal error: absl/base/attributes.h: No such file or directory
   28 | #include "absl/base/attributes.h"
Craigacp commented 3 weeks ago

We modified where it's looking for the headers just before the rc1 release to fix this kind of issue. I tested it on macOS, and I thought I had tested it on a few Linuxes as well. I'll rerun the Linux build to see what's going on.

Craigacp commented 3 weeks ago

So it looks like the problem is that we used to get the absl headers from Bazel, but something has changed in the TF build process so it's not putting the absl repo in the bazel-tensorflow-core-native folder like it used to. We'd missed this because the clean is inconsistent between bazel & non-bazel builds.

snadampal commented 3 weeks ago

Hi @Craigacp , it's not just the absl, there are several other packages are missing too, like Eigen, ml_dtypes, protobuf...... they exist in the repo but the workspaces are not cloned.

Craigacp commented 3 weeks ago

I can replicate this, but we couldn't replicate it on Karl's machine, even after a clean of bazel. Both machines are running macOS 14.5 with the latest XCode, and the same version of bazel so I'm pretty confused as to what's causing the issue.

snadampal commented 3 weeks ago

I'm surprised in the working case where it is getting the all absl/Eigen/ml_dtype headers from. Probably checking the include paths for libjnitensorflow.cpp compilation might give some clue? btw, it's consistently failing on linux.

Craigacp commented 3 weeks ago

No, in some cases the external folder in bazel-tensorflow-core-native has extra folders in it linking to the dependencies we need the headers for, which we add to the include path in the pom. Not sure why bazel only puts them in some of the time. Not ruled out some memory on the machine that works yet.