Use of tensorflow_decision_forests

mattomatic commented 1 year ago

Hi,

Firstly, thank you for your great work making the tensorflow library accessible within the java ecosystem.

I would like some guidance or info about how to incorporate a third party library tensorflow_decision_forests (https://www.tensorflow.org/decision_forests) into tensorflow-java. Particularly, I would like to take a trained model and run inference with it from a java app.

I've given this a number of tries on my own. I am using the 0.5.0-SNAPSHOT that targets tf 2.9.1 along with the version of tfdf version 0.2.7 which also targets tf 2.9.1. Without any modifications tensorflow-java or library loading, the error that arises is this:

2022-12-06 08:16:16.811242: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:301] SavedModel load for tags { serve }; Status: fail: NOT_FOUND: Op type not registered 'SimpleMLCreateModelResource' in binary running on localhost. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.. Took 16415 microseconds.

This is the obvious error as I'm aware the library introduces custom tensorflow operations which are probably needed to run inference on the saved model. I attempted to raid the shared library .so files from the library distribution and load them using Tensorflow.loadLibrary from within docker containers, but this did not work for me:

TensorFlow.loadLibrary("/usr/local/anaconda3/lib/python3.8/site-packages/tensorflow_decision_forests/tensorflow/ops/inference/inference.so");
TensorFlow.loadLibrary("/usr/local/anaconda3/lib/python3.8/site-packages/tensorflow_decision_forests/tensorflow/ops/training/training.so");

I'm a little out of my depth as to how to get this to work properly, and I would appreciate any helpful info, resources or direction.

I note that the most similar looking issue recently was https://github.com/tensorflow/java/pull/468 -- which also involves loading shared libraries for a 3rd party library.

Craigacp commented 1 year ago

Did you get any logs or errors out of the TensorFlow.loadLibrary call? We have tried to make TFDF work before, but there was a bug in it which means it wasn't compatible with the TF C API we use to interact with TF. I believe this was fixed (see https://github.com/tensorflow/decision-forests/issues/81).

mattomatic commented 1 year ago

Thanks again. I followed the instructions on https://github.com/tensorflow/decision-forests/issues/81 but did not get as far as the commenter there because I get an error immediately on the call to Tensorflow.loadLibrary:

java.lang.UnsatisfiedLinkError: /tmp/inference.so: undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringB5cxx11ERKNS_15OpKernelContextEb

    at org.tensorflow.TensorFlow.loadLibrary(TensorFlow.java:103)

Following the instructions there I downloaded this python wheel and extracted the inference.so and placed it in /tmp/ of a remote Centos7 linux machine, then tried to run a unit test there that calls loadLibrary:

    @Test
    public void testLoadLibrary() {
        OpList myInferenceOps = TensorFlow.loadLibrary("/tmp/inference.so");
        assertThat(myInferenceOps.getOpCount()).isGreaterThan(0);
    }

I also tried grabbing a few different versions of shared libraries from tensorflow_decision_forests, such as the newest ones, and tried on some different docker containers, but got the same error, or similar errors on calls to load library.

Craigacp commented 1 year ago

Hmm. Well that symbol has been in TF for years, so it's not new enough that we're missing it. Maybe it's not exported from our build of the native libraries? @karllessard is that something we can easily control?

karllessard commented 1 year ago

If you look at this patch, I think that is where we can define additional symbols to be exported when we compile TF.

But looking at the exported symbols in the TensorFlow binaries we are building, this symbol is present at the exception that it is missing the "B5cxx11" part:

                 U __ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb
                 I __ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb (indirect for __ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb)

So maybe we are missing a compiler option instead?

karllessard commented 1 year ago

Yeah, if I look at the Python distribution, the "B5cxx11" part is there. I think it is related to set the _GLIBCXX_USE_CXX11_ABI macro value. @saudet might know?

karllessard commented 1 year ago

saudet commented 1 year ago

Right, that library was built for the new ABI:

$ c++filt _ZNK10tensorflow8OpKernel11TraceStringB5cxx11ERKNS_15OpKernelContextEb
tensorflow::OpKernel::TraceString[abi:cxx11](tensorflow::OpKernelContext const&, bool) const

Which isn't compatible with CentOS 7...

karllessard commented 1 year ago

Ok, that's a real problem then... Will that work on other Linux distributions, like Ubuntu and Debian? I would like to avoid building multiple binaries for different distributions but if only CentOS prevents us to support loading libraries like TFDF, it is something we might have to consider...

... and that would also mean that Tensorflow >= 2.9.0 is not supported on CentOS even when using Python?

saudet commented 1 year ago

glibc and libstdc++ are fully backward compatible, so that's not a problem, that's why using the old ABI works, and "manylinux2014" in the case of Python is essentially CentOS 7: https://peps.python.org/pep-0599/

They are migrating away from that by explicitly stating the glibc version instead: https://github.com/pypa/manylinux

It's going to be a mess that TF Core is already prepared to embrace apparently. Since Ubuntu is essentially becoming the most widely distribution, I'm thinking of just moving to that when CentOS 7 reaches EOL (June 2024). Obviously, Oracle is going to disagree, so we might as well start figuring out what we want to do about that here before 2024 arrives.

karllessard commented 1 year ago

Current builds will only not work for users who wants to load TF extensions, like TFDF, that were built on top of TF >= 2.9.0. I don't know what is the proportion of these users but building with the new ABI on Ubuntu will prevent all users on CentOS7 to run TensorFlow with Java.

To summarize again the possible solution (which I'm not a fan of any of them):

Build TFJava Linux binaries on Ubuntu with new ABI enabled (hence not working anymore on CentOS7)
Distribute two versions of TFJava Linux binaries: one with new ABI, one without, where only the former will support TF extension libraries
Compile and distribute TF extensions libraries ourselves, built on top of TF with new ABI disabled

The last option is great in a sense that it could allow us to add extensions to TensorFlow Java by just adding a JAR, instead of having user download the Python wheel and extract the libraries themselves. But I don't know how much work that would be.

karllessard commented 1 year ago

So... I think the last option will endup to be too much work for whatever our small team can contribute to (unless someone new is willing to give it a try).

We've already deprecated usage of Java8 by only maintaining the 0.4.x branch for these users (no new features, only bug and CVE fixes). We can redirect CentOS7 users to that branch as well, and switch to Ubuntu on the 0.5.0 branch. Anyway, this branch is already supporting a version of TensorFlow that is officially not available for CentOS 7 users, regardless of their language bindings. So we hardly can do better than what TensorFlow/Google has decided to do.

Thoughts?

Craigacp commented 1 year ago

Fine by me, CentOS 7 is 8 years old at this point. The next step would be CentOS 8 but after Red Hat killed it it hasn't seen a lot of market uptake. It's roughly the same glibc version as Ubuntu 18.04, but that's also coming to the end of it's lifespan. Ubuntu 20.04 is probably a reasonable target, but I could see a good argument for 18.04 as well because that would catch anyone still on a RHEL 8 variant (e.g. Rocky Linux).

It's worth noting that Google have stopped making patches for TF 2.7 so our 0.4 branch will stop getting those fixes now.

saudet commented 1 year ago

This is odd, the packages on PyPI are still "manylinux2014": https://pypi.org/project/tensorflow/#files

That means they need to be compatible with CentOS 7, so what's going on here

mattomatic commented 1 year ago

Hey I wanted to give you an update. The discussion went towards how to generically solve this problem with new vs old ABI. We took the path described above of compiling TFDF using the old ABI so that it would be compatible with tensorflow-java. This created binaries that I distributed in an internal JAR. I was able to successfully load and use this for inference. Huzzah :)

Sadly, while this solves issue specifically for me, other users are still faced with the same core problem.

karllessard commented 1 year ago

Thanks for your update on this @mattomatic , and happy to hear that you are now unblocked.

I think the plan to fix that permanently is to start reusing in Java the TF binaries that are built for the Python wheels. Hopefully TF 2.12 will make this easier for us, as it distributes the C++ libraries we depend on distinctively from the Python wrappers.

I'll leave this issue opened as a reminder to test it out when/if it is done.

ryanlakritz commented 1 year ago

@mattomatic I've actually been running into this exact issue as well. Thanks for starting this discussion and updating on your solution. Do you have any additional details on the steps you took to compile TFDF using the old ABI? I'm a bit unsure of how to proceed with that. Thanks in advance for your help!

tensorflow / java

Use of tensorflow_decision_forests #480