tensorflow / java

Java bindings for TensorFlow
Apache License 2.0
823 stars 201 forks source link

Unary VariantDecodeFn for type_name: tensorflow::data::WrappedDatasetVariant already registered #226

Open akiou opened 3 years ago

akiou commented 3 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior

I executed Python TensorFlow v2.3.1 and TensorFlow java invoked from JNI by using pyjnius in a Python process. Then, the following error was shown and the Python process was aborted.

2021-02-25 07:57:21.005274: F external/org_tensorflow/tensorflow/core/framework/variant_op_registry.cc:46] Check failed: existing == nullptr (0x56489dc7a258 vs. nullptr)Unary VariantDecodeFn for type_name: tensorflow::data::WrappedDatasetVariant already registered
Aborted

More details:

  1. I prepared a TensorFlow model pre-trained Python TensorFlow. The model is stored in a .pb file.
  2. I executed a Python process, which contains:
    1. import neseccary python libraries such as tensorflow and pyjnius.
    2. invoke JNI by using pyjnius which loads the stored model by using SavedModelBundle.load().
  3. The step 2.2 raised the error described above.

Describe the expected behavior

The user should be able to execute both (python and java) from a Python process without any conflict exceptions.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

I pushed a sample code into https://github.com/akiou/tf_conflict. You can reproduce this error by using the repo sample code.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

The error message is dependeing on the platform OS. The error message in this issue is observed inside a Linux docker container. If you run the sample code in Mac OS, then the following error message is shown instead:

2021-02-25 17:32:28.532107: E tensorflow/core/lib/monitoring/collection_registry.cc:77] Cannot register 2 metrics with the same name: /tensorflow/core/eager_context_created
2021-02-25 17:32:28.532275: F tensorflow/core/framework/op.cc:62] Non-OK-status: RegisterAlreadyLocked(op_data_factory) status: Already exists: Op with name XlaLaunch
[1]    70088 abort      python test/conflict.py
Craigacp commented 3 years ago

I think we probably can't have the TF native library loaded twice into the same process. TF-Java and TF-Python will by necessity use slightly different builds of the native library (as it's compiled and released separately). What's your use case for having both of them loaded in the same process?

akiou commented 3 years ago

Can't we have the TF native library loaded twice into the same process even if we use the same TF version like java 0.2.0 and python 2.3.1 described here?

About the use case:

  1. I have a project written in java which depends on TF java in order to execute prebuild model.
  2. I have another project written in python:
    1. The python project depends on the java project to execute the model,
    2. and also we make the python project train models.
Craigacp commented 3 years ago

They are compiled separately and potentially with different options (e.g. MKL, GPUs, x86_64 features), leading to a conflict. It might be possible to persuade TF-Java to only load the JNI binding and not libtensorflow, but if you did hit native library issues we wouldn't support such a configuration. @saudet are there some flags on the JavaCPP loader which will enable this?

With 0.2.0 and the upcoming 0.3.0 release you should be able to train models in TF-Java (though not all operations have gradients available yet as they aren't all present in the native layer).

saudet commented 3 years ago

TF Core keeps a global state. The only way to get this working is by compiling a version of TF Core that supports both Java and Python APIs. I did that for TF 1.x and it works, but we would need to port this to TF 2.x and then we could maintain this here: https://groups.google.com/a/tensorflow.org/g/jvm/c/T964efemgek/m/OUe0uxV6DAAJ

akiou commented 3 years ago

Thank you for your replies and the reference link!

@saudet You mean that you will compile a version of TF Core that supports both Java and Python APIs for TF 2.x in the near future, right? According to the bytedeco example pom file in the google group thread, the TF java artifact for TF 1.x that works in the same process simultaneously is available. But the java artifact for TF 2.x is not available now, right?

saudet commented 3 years ago

I don't plan to do it myself, no, because we have no build machines that have enough power to complete the Python build anyway. If you have questions though, feel free to ask and I will help!

akiou commented 3 years ago

Then, are there any ways to build the java artifact for TF 2.x? I need the artifact that works in the same python process simultaneously

karllessard commented 3 years ago

@akiou, fwiw, I'm having a hybrid setup where I train in Python but do all the pre/post processing and inference in Java, and I did not had the kind of issues you are mentioning here,

The way I do it is when the Python script starts for training, I use JPype to launch the JVM and load the Java library doing all of the stuff I've mentioned before, and invoke the Java classes directly from Python.

It really worked great in my case, if you are interested I can share with you more details to help you set it up that way.

akiou commented 3 years ago

@karllessard Perhaps, you might have made a misunderstanding.

Also in my case, I did not get an exception just executing a model in TF java invoked from a Python process. The problem described in this issue can be reproduced when the TF java model execution from a Python process after the same Python process imports TF 2.x. I need both of the TF python 2.x and TF java.

saudet commented 3 years ago

Then, are there any ways to build the java artifact for TF 2.x? I need the artifact that works in the same python process simultaneously

You will need to perform the build as per this script when EXTENSION=-python https://github.com/bytedeco/javacpp-presets/blob/master/tensorflow/cppbuild.sh

akiou commented 3 years ago

@saudet I'd like to confirm one thing about the build with extension=-python. The build enables us to run a python script from a Java process like https://github.com/bytedeco/javacpp-presets/blob/master/tensorflow/samples/KerasMNIST.java#L32-L49, right? If so, I think the javacpp build with extension=-python cannot realize what I want to do.

What I want to do is not invoke TF python 2.x from java but invoke TF java from a python process that already imports TF 2.x. This is because I'd like the usecase described in https://github.com/tensorflow/java/issues/226#issuecomment-786746188.

saudet commented 3 years ago

Sure, that's possible too. You'll need a way to use JNI from Python, but that's easily doable with tools like jpype, pyjnius, etc.

akiou commented 3 years ago

Thank you! I cloned the repository javacpp-presets, checkouted tag 1.5.4 and executed the following command:

$ mvn clean install --projects .,tensorflow -Djavacpp.platform.extension=-python

then, some jars are generated intensorflow/target/ directory. So I should use the generated jars instead of tensorflow/java, right? But the jars do not include classes included in tensorflow-java such as org.tensorflow.SavedModelBundle, so I suppose that the replacement would occur a compilation error.

Additionally, I tried to replace the tensorflow version with 2.3.1 in pom.xml and cppbuild.sh, but the maven command was failed. I think I need to specify the tensorflow version to 2.3.1, so how can I specify the tensorflow version?

saudet commented 3 years ago

Thank you! I cloned the repository javacpp-presets, checkouted tag 1.5.4 and executed the following command:

$ mvn clean install --projects .,tensorflow -Djavacpp.platform.extension=-python

then, some jars are generated intensorflow/target/ directory. So I should use the generated jars instead of tensorflow/java, right? But the jars do not include classes included in tensorflow-java such as org.tensorflow.SavedModelBundle, so I suppose that the replacement would occur a compilation error.

Yes, the same class is available in that JAR file: http://bytedeco.org/javacpp-presets/tensorflow/apidocs/org/tensorflow/SavedModelBundle.html

Additionally, I tried to replace the tensorflow version with 2.3.1 in pom.xml and cppbuild.sh, but the maven command was failed. I think I need to specify the tensorflow version to 2.3.1, so how can I specify the tensorflow version?

That's what I keep telling you: Someone will need to work on updating that for TF 2.x.

akiou commented 3 years ago

That's what I keep telling you: Someone will need to work on updating that for TF 2.x.

I see, then the jar for TF 2.x that can be invoked from a python process already importing TF 2.x is not available until someone works on updating that for TF2.x if my understanding is correct.

akiou commented 3 years ago

Additionally, I have another question (sorry for a lot of questions)

Yes, the same class is available in that JAR file: http://bytedeco.org/javacpp-presets/tensorflow/apidocs/org/tensorflow/SavedModelBundle.html

I tried to insert a statement SavedModelBundle savedModel = SavedModelBundle.load(...); just after this line, but the following error was occurred. Can I really use SavedModelBundle by the same usage as the apidocs? I used the pom.xml as it is, and I specified the model path that is trained by TF 1.x.

java.lang.UnsatisfiedLinkError: org.tensorflow.SavedModelBundle.load(Ljava/lang/String;[Ljava/lang/String;[B[B)Lorg/tensorflow/SavedModelBundle;
    at org.tensorflow.SavedModelBundle.load (Native Method)
    at org.tensorflow.SavedModelBundle.access$000 (SavedModelBundle.java:27)
    at org.tensorflow.SavedModelBundle$Loader.load (SavedModelBundle.java:32)
    at org.tensorflow.SavedModelBundle.load (SavedModelBundle.java:95)
    at KerasMNIST.main (KerasMNIST.java:21)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
    at java.lang.Thread.run (Thread.java:748)
saudet commented 3 years ago

Make sure that Loader.load(tensorflow.class) has been called: https://github.com/bytedeco/javacpp-presets/tree/1.5.4/tensorflow#documentation

akiou commented 3 years ago

I was able to load the model by using SavedModelBundle.load()! Thank you! But of course, the model trained by TF 2.x still cannot be executed in java. So I'm waiting that someone from the TensorFlow team takes a look into this issue and builds java artifact for TF 2.x that can be invoked from a python process already importing tensorflow.

saudet commented 3 years ago

@akiou Since building TF Core is quite a challenge in itself, I've been experimenting with linking and loading the _pywrap_tensorflow_internal.so file that comes with the binary distributions of TensorFlow on PyPI, instead of libtensorflow_cc.so.2 that gets built by default for the target of the C++ API, and it works! The tests pass and everything. If you are comfortable hacking with shared libraries, please feel free to do that.

@karllessard @Craigacp That would also be one way to circumvent our build issues. It would add a dependency on CPython, but it works, and it would make it possible to use both the Java and Python APIs in the same process. (To be clear, we don't need to use Python, we just need to link with the native CPython library to satisfy the undefined symbols.)

JiangWork commented 3 years ago

I have meet the same issue. Any plan to upgrade TF 2.x so it can be called by Python and Java in the same process?