tensorflow / java

Java bindings for TensorFlow
Apache License 2.0
787 stars 195 forks source link

The Java Tensorflow library does not seem to be using GPU #140

Open tmichniewski opened 3 years ago

tmichniewski commented 3 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Here is the result of the capture script: tf_env.txt

Describe the current behavior We tested the new Tensorflow Java API (not the legacy one). The brand new version released in October 2020. We tested it on some machines including Azure Databricks NC6_v3 and Azure Virtual Machines (the capture log is from the virtual machine). I noticed that in case of no GPU available the library falls back to CPU. And this is fine. However we also measured the time for some example processing (a few vector operations). And we see that there is no significant difference between processing time on GPU and on CPU. It looks as it is not using GPU, even if this is present (we tried two graphic cards: Tesla K80 with compute compatibility 3.7 and Tesla V100 with compute compatibility 7.0). In both cases we do not see any difference in processing time.

Describe the expected behavior Expected behaviour is to get execution times much better if the program is executed on a machine with GPU present.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

We used the following java program:

HelloTensorFlow_java.txt pom_xml.txt

The source was compiled to class file and it was run via the following command: java -classpath protobuf-java-3.8.0.jar:ndarray-0.2.0.jar:javacpp-1.5.4.jar:javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.2.0.jar:tensorflow-core-api-0.2.0-linux-x86_64-gpu.jar:tensorflow-core-platform-gpu-0.2.0.jar:. HelloTensorFlow

The listed libraries were downloaded from https://oss.sonatype.org/.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

The enclosed program issues the following log: log.txt

From the log you may see that the GPU was present and recognized. However the execution time did not differ, when we started it with GPU and without.

karllessard commented 3 years ago

Thanks a lot for the investigation @zaleslaw , that is very appreciated and these results are closer to what we can expect from a GPU with TF Java. So I guess the reason we were not observing this delta previously was just that the example that we've been using? My concern though is why in Python the same example shows a performance improvement while it does not in Java, there is still something that we are missing here.

tmichniewski commented 3 years ago

Hello Alexey @zaleslaw, Could perform additional test on your machine, but this time with our example, which Karl mentioned? We have plain vector computation instead of sophisticated machine learning algorithms.

zaleslaw commented 3 years ago

@tmichniewski will try to reproduce and share results

zaleslaw commented 3 years ago

@tmichniewski @karllessard There are a lot of different code snippets, I used one from here

My results:

GPU:

Time [s.] 2.402464902
Bye bye
test_gpu_loop/Enter: (Enter): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/Merge: (Merge): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/cond/Less: (Less): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/LoopCond: (LoopCond): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/Switch: (Switch): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Mul_1: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Div: (Div): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Add: (Add): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Add_1: (Add): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/NextIteration: (NextIteration): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/Exit: (Exit): /job:localhost/replica:0/task:0/device:GPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/cond/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_2: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_3: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_4: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_5: (Const): /job:localhost/replica:0/task:0/device:GPU:0

Process finished with exit code 0

CPU:

Time [s.] 1.63166006
Bye bye
test_gpu_loop/Enter: (Enter): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/Merge: (Merge): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/cond/Less: (Less): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/LoopCond: (LoopCond): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/Switch: (Switch): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Mul: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Mul_1: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Div: (Div): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Add: (Add): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Add_1: (Add): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/NextIteration: (NextIteration): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/Exit: (Exit): /job:localhost/replica:0/task:0/device:CPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/cond/Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_3: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_4: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_5: (Const): /job:localhost/replica:0/task:0/device:CPU:0

Looks like GPU is slower than CPU, but task is quite small and overhead is great.

I could test on my NVIDIA card yet something, if you need.

tmichniewski commented 3 years ago

Hello @zaleslaw If the task seems to be too simple it might be increased to vector of 100 mln. elements. In Python we observed that GPU version was much faster than CPU one. This was like seconds to minutes. Why noone observed it on Java API? What is wrong in Java API?

Craigacp commented 3 years ago

Expanding the vector size doesn't stress the right places. If (as it appears given the information we have) there's something going wrong with the initial setup and copy then increasing the vector size bumps that as well as the computational workload.

We're not sure what's happening yet, clearly something is wrong but we've not got a root cause.

tmichniewski commented 3 years ago

Is there anyone who would be able to debug it, to check what is the cause of the problem? Why it is so difficult?

karllessard commented 3 years ago

Hi @tmichniewski , I think what's makes it a bit difficult for us to debug is that:

  1. GPUs are handled entirely by the TensorFlow C++ library and normally Java should have nothing to do with it, so we need to dig deeper into the TensorFlow runtime and Python client to understand what is going wrong with your test application, while our expertise is more on the Java bindings themselves.

  2. Not all of us, including me, has access to a local GPU machine.

  3. This is a community-driven project and all collaborators are working on it on their free will and time.

So yes I think we are all interested to understand better the problem and meanwhile if you can please continue your investigations, the project would surely benefit from it.

Thanks, Karl

tmichniewski commented 3 years ago

Hello Karl @karllessard, Ad.1. Have you looked at Python TF client to see how it uses the c++ library and what are the differences. In fact Java TF API should mimic what Python client is doing, so it is worth analysing even without my issue reporting. Ad.2. Sure, if you provide me with any source code (recommended improvements, configuration or logging) I may give it a go. But as I understand we tested all the improvements recommended by you and they only confirmed that something is wrong without giving the precise explanation. But still, we may repeat these tests, but I would need some source code input from you. Ad.3. Finally, maybe it is worth trying to compare to version 1.15.0 or 2.3.1 of Java TF API, so right now the legacy version, but since last month only. Maybe these versions were working somehow differently? At least I got messages, that to make it working on my GPU compute compatibility 7.0 I would have to build this API from sources, as there was some issue with PTX files or something like this and only the most recent compute capabilities were supported by default. So maybe this is some clue?

karllessard commented 3 years ago

Have you looked at Python TF client to see how it uses the c++ library and what are the differences

One major difference is that the TensorFlow runtime has this thin ABI layer called "C API" which should be used by language binding libraries like TF Java to guarantee stability between TF releases but we all know that Python bypasses this layer in many cases and access directly the internal C++ classes. So maybe the C API is creating the overhead we observe or maybe it is not exposing everything that we need to enable GPU computation properly.

Finally, maybe it is worth trying to compare to version 1.15.0 or 2.3.1 of Java TF API

I completely agree, that would be an interesting test to do. The legacy version of TF Java is also calling exclusively the C API but the linkage is done differently (legacy version have custom JNI code to bind to the runtime library while the new version uses auto-generated JNI bindings using JavaCPP).

I don't expect the performances to be that different between the two versions but if they are, it will be much simpler to drill down to the actual cause of the problem.

On another note, the example from @zaleslaw shown that GPU is effectively supported by the new TF Java, and once the model has been build and ran in the C++ runtime, there is no interaction with the JVM that could impact the performances. So for now my suspicions are that:

  1. Most of the time in your example is spent before or after running the graph (in the C API maybe?)
  2. The graph session is not initialized properly (e.g. Python could use an internal tweak to enable more GPU kernels)

Karl

unik00 commented 1 year ago

It my case, it was because: while libtensorflow.jar was the latest version, I used libtensorflow_jni.so of outdated version. Eventhough it still run without error, GPU utilization percentage was very low. Not sure if this is your issue.