larq / compute-engine

Highly optimized inference engine for Binarized Neural Networks
https://docs.larq.dev/compute-engine
Apache License 2.0
240 stars 33 forks source link

:arrow_up: tensorflow@2.8.0 #722

Closed CNugteren closed 2 years ago

CNugteren commented 2 years ago

What do these changes do?

This PR updates LCE to TensorFlow 2.8.0. Quite a few changes have been made, and most likely more are needed. I'll highlight some of the more controversial changes below in the form of a comment.

How Has This Been Tested?

All unittests have been ran locally. Fixes have been made as compilation, linking, or testing errors showed-up. However, some of them do not succeed yet, most notably the MLIR tests. This PR is therefore still a draft.

Benchmark Results

Latency has not been verified yet.

Related issue number

N/A.

Tasks

Open tasks before being able to turn the draft PR in an actual PR:

CNugteren commented 2 years ago

I've removed the old make-builds, and replaced it with the new CMake build system. I've only done something quite basic, building the library and the example and benchmark binaries. I tested the benchmark binary locally and that seems to work. So I think we are ready for a last round of reviews now.

lgeiger commented 2 years ago

I've removed the old make-builds, and replaced it with the new CMake build system. I've only done something quite basic, building the library and the example and benchmark binaries. I tested the benchmark binary locally and that seems to work. So I think we are ready for a last round of reviews now.

Awesome! 🚀

Just for reference, so we don't forget. Once this is merged and we've released a new version we will also need to update build docs.

lgeiger commented 2 years ago

I triggered a test release, let's see how this goes.

Tombana commented 2 years ago

I triggered a test release, let's see how this goes.

The windows looks like it succeeded within 12 minutes but actually the build failed with

The target you are compiling requires Visual C++ build tools. 
Bazel couldn't find a valid Visual C++ build tools installation on your machine.

The linux builds fail because the manylinux toolchain bazel files can't be found; they've probably been updated to newer versions.

lgeiger commented 2 years ago

The windows looks like it succeeded within 12 minutes but actually the build failed with

I pinned the Windows VM version to 2019 in 5bf5748f3d3b4258a3efce19f0d0d7ab9837d85f, let's see if this fixes the issue. Alternatively we could also change the BAZEL_VC path to make it work with Windows 2022, but not sure if we would run into other issues then.

The linux builds fail because the manylinux toolchain bazel files can't be found; they've probably been updated to newer versions.

I pinned the docker image to 2.8 in 3923692ac56690a87439536f82d960cbcfa02a29 which didn't seem to fix things.

It seems like the TensorFlow repo removed their toolchains from the repo. I copied the toolchain from the TF Addons repo in bdb6f2009d72cee0fa17a550e507782ccd8e0718. They use the same docker container to build the wheels as we do, so I expect this to work.

I triggered a new build to test these changes.

CNugteren commented 2 years ago

Two builds failed:

Update: the linux build fails with:

ERROR: /root/.cache/bazel/_bazel_root/db801a7aca49c33ba961fa7cde3e7bf7/external/org_tensorflow/tensorflow/compiler/mlir/tensorflow/BUILD:573:11: Compiling tensorflow/compiler/mlir/tensorflow/ir/tf_ops.cc failed: (Exit 4): crosstool_wrapper_driver_is_not_gcc failed: error executing command 
(...)
gcc: internal compiler error: Killed (program cc1plus)

But in the latest build it is the 3.8 version of the linux build that fails and the 3.9 passes, so that seems random: https://github.com/larq/compute-engine/runs/6036809834?check_suite_focus=true

CNugteren commented 2 years ago

The timeout increase did not work, it still times out after 6 hours: https://github.com/larq/compute-engine/runs/6036809795?check_suite_focus=true. Perhaps that is some global setting or a Github Actions maximum? In any case, this time two out of the 3 Windows builds passed since they finished in five and a half hours, only one timed out. This behaviour was also seen earlier without TF 2.8, it was always tight with this 6 hours maximum. So I do not consider this related to this branch.

My proposal is to merge this in and then investigate making the releases more stable in a separate PR (Linux builds random failures and Windows builds time outs)? I can open an issue for that.

Tombana commented 2 years ago

I agree, let's merge it.

I think I've seen this before and in that particular case the system had run out of memory. That could be fixed by adding swap space or changing the number of jobs that bazel runs in parallel. But maybe this problem is different.

gcc: internal compiler error: Killed (program cc1plus)