tensorflow / decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Apache License 2.0
660 stars 110 forks source link

FULL_COMPILATION=0 does not build all necessary targets #178

Closed mowoe closed 1 year ago

mowoe commented 1 year ago

Hi! I am trying to build an arm wheel, which is a lot more challenging than i originally thought. Currently im building the wheel while building a Docker image (see below for the Dockerfile). This Dockerfile builds the wheel just fine, but the build seems to be missing some things:

$ python3 -c "import tensorflow_decision_forests as tfdf; print('Found TF-DF v' + tfdf.__version__)"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/tensorflow_decision_forests/__init__.py", line 64, in <module>
    from tensorflow_decision_forests import keras
  File "/usr/local/lib/python3.10/site-packages/tensorflow_decision_forests/keras/__init__.py", line 53, in <module>
    from tensorflow_decision_forests.keras import core
  File "/usr/local/lib/python3.10/site-packages/tensorflow_decision_forests/keras/core.py", line 59, in <module>
    from tensorflow_decision_forests.component.inspector import inspector as inspector_lib
  File "/usr/local/lib/python3.10/site-packages/tensorflow_decision_forests/component/inspector/inspector.py", line 64, in <module>
    from tensorflow_decision_forests.component import py_tree
  File "/usr/local/lib/python3.10/site-packages/tensorflow_decision_forests/component/py_tree/__init__.py", line 20, in <module>
    from tensorflow_decision_forests.component.py_tree import condition
  File "/usr/local/lib/python3.10/site-packages/tensorflow_decision_forests/component/py_tree/condition.py", line 26, in <module>
    from tensorflow_decision_forests.component.py_tree import dataspec as dataspec_lib
  File "/usr/local/lib/python3.10/site-packages/tensorflow_decision_forests/component/py_tree/dataspec.py", line 24, in <module>
    from yggdrasil_decision_forests.dataset import data_spec_pb2
ModuleNotFoundError: No module named 'yggdrasil_decision_forests.dataset'

I suspect build rules defined in test-bazel.sh#167 are not correct, but i am not experienced enough with bazel to find out the correct build rules.


FROM python:3.10-buster
RUN apt update
RUN apt install -y git

# Installing JAX
RUN git clone -b jaxlib-v0.4.10 https://github.com/google/jax
WORKDIR /jax
RUN pip install numpy wheel
RUN git clone https://github.com/openxla/xla.git /xla
WORKDIR /jax
RUN python build/build.py --bazel_options=--override_repository=xla=/xla
RUN pip install dist/*.whl
RUN pip install -e .

# Installing bazel
WORKDIR /
RUN wget https://github.com/bazelbuild/bazel/releases/download/6.2.0/bazel-6.2.0-linux-arm64
RUN chmod +x /bazel-6.2.0-linux-arm64
RUN ln -s /bazel-6.2.0-linux-arm64 /usr/bin/bazel

WORKDIR /
# Use fork while PR #176 is not merged yet
RUN git clone https://github.com/mowoe/decision-forests
WORKDIR /decision-forests

COPY ./test_bazel.patch /decision-forests/test_bazel.patch
COPY ./build_pip_package.patch /decision-forests/build_pip_package.patch
COPY ./patched_gcc /patched_gcc
RUN chmod +x /patched_gcc
RUN rm /usr/bin/gcc
RUN ln -s /patched_gcc /usr/bin/gcc
RUN git apply build_pip_package.patch
RUN git apply test_bazel.patch

ENV TF_VERSION=2.13.0-rc0
ENV PY_VERSION=3.10
ENV FULL_COMPILATION=0
ENV TF_NEED_CUDA=0
RUN ./tools/test_bazel.sh

RUN apt update && apt install -y patchelf
RUN ./tools/build_pip_package.sh python3.10

Due to multiple issues, some pretty ugly patching is neccessary: patched_gcc (avx not availible in arm docker):

#!/bin/bash

args=()
for arg in "$@"; do
  if [[ $arg != "-mavx" ]]; then
    args+=("$arg")
  fi
done

gcc-8 "${args[@]}"

test_bazel.patch (Remove all cuda targets from tensorflow):

diff --git a/tools/test_bazel.sh b/tools/test_bazel.sh
index 98af492..ae961bf 100755
--- a/tools/test_bazel.sh
+++ b/tools/test_bazel.sh
@@ -68,6 +68,15 @@ sed -i'.bak' -e "s/sha256 = \"${prev_shasum}\",//" WORKSPACE
 # Get build configuration for chosen version.
 TENSORFLOW_BAZELRC="tensorflow_bazelrc"
 curl https://raw.githubusercontent.com/tensorflow/tensorflow/${commit_sha}/.bazelrc -o ${TENSORFLOW_BAZELRC}
+tempfile=$(mktemp)
+
+while read line; do
+  if [[ $line != *"cuda"* ]]; then
+    echo "$line" >> "$tempfile"
+  fi
+done < "$TENSORFLOW_BAZELRC"
+
+mv "$tempfile" "$TENSORFLOW_BAZELRC"

 # Force a compiler
 # export CC=gcc-8

build_pip_package.patch:

diff --git a/tools/build_pip_package.sh b/tools/build_pip_package.sh
index dbef740..7e92567 100755
--- a/tools/build_pip_package.sh
+++ b/tools/build_pip_package.sh
@@ -154,35 +154,34 @@ function test_package() {
   if is_macos; then
     PACKAGEPATH="dist/tensorflow_decision_forests-*-cp${PACKAGE}-cp${PACKAGE}*-*.whl"
   else
-    PACKAGEPATH="dist/tensorflow_decision_forests-*-cp${PACKAGE}-cp${PACKAGE}*.manylinux2014_x86_64.whl"
+    PACKAGEPATH="dist/tensorflow_decision_forests-*-cp${PACKAGE}-cp${PACKAGE}*manylinux_2_28_aarch64.whl"
   fi
   ${PIP} install ${PACKAGEPATH}

-
   ${PIP} list
   ${PIP} show tensorflow_decision_forests -f

@@ -199,9 +198,9 @@ function e2e_native() {
     PACKAGEPATH="dist/tensorflow_decision_forests-*-cp${PACKAGE}-cp${PACKAGE}*-*.whl"
   else
     check_auditwheel ${PYTHON}
-    PACKAGEPATH="dist/tensorflow_decision_forests-*-cp${PACKAGE}-cp${PACKAGE}*-linux_x86_64.whl"
+    PACKAGEPATH="dist/tensorflow_decision_forests-*-cp${PACKAGE}-cp${PACKAGE}*-linux_aarch64.whl"
     TF_DYNAMIC_FILENAME="libtensorflow_framework.so.2"
-    ${PYTHON} -m auditwheel repair --plat manylinux2014_x86_64 -w dist --exclude ${TF_DYNAMIC_FILENAME} ${PACKAGEPATH}
+    ${PYTHON} -m auditwheel repair --plat manylinux_2_28_aarch64 -w dist --exclude ${TF_DYNAMIC_FILENAME} ${PACKAGEPATH}
   fi

   test_package ${PYTHON} ${PACKAGE}

Sidenote: Setting FULL_COMPILATION=1 causes the build to fail because of some unrelated tensorflow issues and shouldnt be necessary to build the library. As far as I can see it is the same issue described here. In any case, the error is in upstream tensorflow.

rstz commented 1 year ago

Hi, just chiming in briefly. I wasn't fully able to debug this problem, but I'll share what I know.

I don't believe test-bazel.sh#167 is the issue - the targets are ok. There seems to be an issue with the package structure. Can you upload the wheel you produced somewhere so I can inspect it?

mowoe commented 1 year ago

Hi @rstz,

thanks for your reply. Here are the wheels i built (github only supports zips): tensorflow_decision_forests-1.3.0-cp310-cp310-linux_aarch64.whl.zip tensorflow_decision_forests-1.3.0-cp310-cp310-manylinux_2_28_aarch64.whl.zip

rstz commented 1 year ago

Thanks! It looks like build_pip_package.sh is either not compiling or not properly copying over the ydf. The relevant lines are L119-L127 of build_pip_package.sh. Could you please check if the necessary files are there before copying (in particular the python files like data_spec_pb2.py in bazel-bin/external/ydf/yggdrasil_decision_forests/dataset/)

mowoe commented 1 year ago

Thank you so much for the hint @rstz ! The actual problem turned out to be that my minimal debian image did not include rsync (as expected) and the build script did not fail but rather just didnt execute the commands. Now that i added rsync, i was able to build the arm wheel successfully: tensorflow_decision_forests-1.3.0-cp310-cp310-manylinux_2_28_aarch64.whl.zip Sorry for wasting your time!

rstz commented 1 year ago

yay! šŸ„³

If you're building this wheel for a specific project that you can share (either publicly or via email to me), feel free to do so, we're happy to know what people are working on with TF-DF šŸ˜„

mowoe commented 1 year ago

@rstz actually i built the arm wheel for a specific project, but it might be a bit underwhelming: šŸ˜‰

I use tf-df for my numer.ai model. Currently i have to run an ipython notebook every other day which is a bit tedious, but numerai supports calling a webhook when submissions are due. Currently only aws SageMaker FaaS stuff is documented, which i did not want to use for a number of reasons. Instead i wanted to use fission, which is an open-source k8s FaaS framework. As i am doing all of this for fun and no profit, i didnt feel like spending any money on managed k8s like gke or eks. Oracle Cloud Infrastructure supports a three-node managed k8s cluster in its forever-free tier, which i have used for other projects. This has one major drawback though: The compute instances are ARM instances. This is why i needed an arm wheel.

TL;DR i automated a ~2min task by spending hours trying to build the arm wheel for a made-up problem šŸ˜

rstz commented 1 year ago

i automated a ~2min task by spending hours trying to build the arm wheel for a made-up problem šŸ˜

I love it šŸ˜…

Thank you for reporting back, sharing the wheel you build and good luck in the competition!