tensorflow / java

Java bindings for TensorFlow
Apache License 2.0
785 stars 193 forks source link

The Java Tensorflow library does not seem to be using GPU #140

Open tmichniewski opened 3 years ago

tmichniewski commented 3 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Here is the result of the capture script: tf_env.txt

Describe the current behavior We tested the new Tensorflow Java API (not the legacy one). The brand new version released in October 2020. We tested it on some machines including Azure Databricks NC6_v3 and Azure Virtual Machines (the capture log is from the virtual machine). I noticed that in case of no GPU available the library falls back to CPU. And this is fine. However we also measured the time for some example processing (a few vector operations). And we see that there is no significant difference between processing time on GPU and on CPU. It looks as it is not using GPU, even if this is present (we tried two graphic cards: Tesla K80 with compute compatibility 3.7 and Tesla V100 with compute compatibility 7.0). In both cases we do not see any difference in processing time.

Describe the expected behavior Expected behaviour is to get execution times much better if the program is executed on a machine with GPU present.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

We used the following java program:

HelloTensorFlow_java.txt pom_xml.txt

The source was compiled to class file and it was run via the following command: java -classpath protobuf-java-3.8.0.jar:ndarray-0.2.0.jar:javacpp-1.5.4.jar:javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.2.0.jar:tensorflow-core-api-0.2.0-linux-x86_64-gpu.jar:tensorflow-core-platform-gpu-0.2.0.jar:. HelloTensorFlow

The listed libraries were downloaded from https://oss.sonatype.org/.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

The enclosed program issues the following log: log.txt

From the log you may see that the GPU was present and recognized. However the execution time did not differ, when we started it with GPU and without.

karllessard commented 3 years ago

Hi @tmichniewski ,

Can you assign explicitly an operation to a GPU device and see if that works?

Now, there is an ongoing effort to make device selection easier on TF Java, as right now you need unfortunately to use the lower-level API for building an operation.

For example, instead of simply call tf.math.add(x, y), you need to do something like:

DeviceSpec gpuSpec = DeviceSpec.newBuilder().deviceType(DeviceType.GPU).deviceIndex(0).build();
...
graph.opBuilder(Add.OP_NAME, "Add1").setDevice(gpuSpec.toString()).addInput(x.asOutput()).addInput(y.asOutput()).build();
karllessard commented 3 years ago

You can also log device placement by passing explicitly a ConfigProto message when instantiating your Session, using this constructor

tmichniewski commented 3 years ago

Hello @karllessard, Referring to the first recommendation: are you sure that we could call deviceType on DeviceSpec.newBuilder()? As my IntelliJ complains about this.

tmichniewski commented 3 years ago

Basically I am looking for simplest solution, just for testing the performance.

tmichniewski commented 3 years ago

BTW - I am using version 0.2.0.

karllessard commented 3 years ago

@tmichniewski , to be honest I haven't tested the code snippet I've posted you but something around that should work. You can also simply pass directly the device spec as a string for now, as described here. I'll be curious to see what device placement logging is telling you as well, if you get a chance to activate it.

tmichniewski commented 3 years ago

I already used hardcoded string "/GPU:0".

tmichniewski commented 3 years ago

Hello @karllessard, Well, I made a tests on Azure Databricks machine NC6s_v3 with graphic card Tesla V100 compute comtability 7.0. With device set to "/CPU:0" the execution time is 0.23s. With device set to "/GPU:0" the execution time is 0.62s. So not only it does not reduce the execution time but also it increases it. The current code is as follows: HelloTensorFlow_java.txt

tmichniewski commented 3 years ago

BTW - on this cluster in Python we got following times: CPU 1.00s. GPU 0.25s. PythonHelloTensorFlow_py.txt So in Python GPU is 4 times faster in this exercise. But in Java GPU version is slower than CPU version.

tmichniewski commented 3 years ago

Maybe I should also set the device to the session or graph?

tmichniewski commented 3 years ago

But then it would make no sense to set it on operation level. Do you have any working example of how to perform let say vector addition on GPU?

Craigacp commented 3 years ago

Why not try running the Java version in eager mode like the python version is?

tmichniewski commented 3 years ago

Firstly, because eager mode is only for development, not production. Secondly - well - for tests of course I might use it, but do you have some working example with suggestion how to pass the device on which TF is being executed? Originally I used it, but Karl (in his first comment) sugested to use low level API.

Craigacp commented 3 years ago

But the python example is in eager mode, so let's rule out the difference in the TF runtime engine first, before moving on to issues inside the Java API. I'm sure there are speed issues in the Java API, but best start with equivalent metrics first. Either run the Java one in eager mode, or the python one as a tf.function.

tmichniewski commented 3 years ago

@Craigacp Hello Adam,

I am not comparing Python times to Java times. I try to compare GPU time to CPU time. Python code was only to show that GPU version was roughly 4x faster then CPU. Or in other words - I try to run TF Java API on GPU. So far without success.

Right - in Python it was in eager mode. The same in original Java version I guess. But this issue is about how to make java API to execute on GPU. So far I was pointed to the low level API, but this does not work. That is why I ask for:

tmichniewski commented 3 years ago

Moreover, Karl also pointed me in his third comment to the following description, which clearly states that if there is a choice, then GPU takes precedence:

"If a TensorFlow operation has both CPU and GPU implementations, by default the GPU devices will be given priority when the operation is assigned to a device. For example, tf.matmul has both CPU and GPU kernels. On a system with devices CPU:0 and GPU:0, the GPU:0 device will be selected to run tf.matmul unless you explicitly request running it on another device."

karllessard commented 3 years ago

I cannot think of any reason why running graph ops on a GPU in Java would be slower than in Python, since this is all handled by the C++ core library shared by both languages. So if it is really what we are observing, then it has to be a misconfiguration somewhere.

Another idea, can you check @tmichniewski if the performance goes back to normal if you remove the line inside the loop where you fetch the result from the tensor? Maybe you are fetching data from the GPU memory to the JVM and that is doing extra copies that are not required when using Python?

tmichniewski commented 3 years ago

I also do not believe that Java version could be slover than Python one. But this is not the issue here. The problem is that it seems that GPU is not used at all.

The previous version of this library 2.3.0, the so called lagacy (since last weekend) was saying that the kernel image was incorect and there was suggestion to build TF from sources. But here the library silently falls back to CPU and you never know whether GPU is used except by measuring time.

Refering to second paragraph, well, I thought that it is always necessary to get the data back from tensor. BTW, we read only one double value, because the last operation in graph is reduce_sum, so we take 8 bytes instead of big vector. Therefore it cannot be reason of slow performance.

PS. Do you test it on GPU? I guess so, therefore I think you should have some observations and know-how of how to make it work on GPU.

tmichniewski commented 3 years ago

Maybe there is some issue with graphic card compute capability. In our case it is 7.0 and was also 3.7. Does this library work with cards with such parameters? If not then this would explain why it silently falls back to CPU.

Or maybe the issue is that the choice of GPU device is not defined correctly. So I ask for some example of how to do this. For example how to perform vector addition on GPU. Very simple example. So far I provided two versions, eager and graph, but both are not using GPU.

karllessard commented 3 years ago

Sorry @tmichniewski but I'm not personally using GPUs nor have access to one right now so I might not be the best person to provide you a concrete working example of this here, maybe @zaleslaw or @saudet can chime in?

Again, I think logging device placement might reveal some explanations as well, see the Session constructor link I've provided earlier.

saudet commented 3 years ago

I haven't been testing it really closely, but I know at least @roywei and @lanking520 have been using the GPU builds without too much trouble.

tmichniewski commented 3 years ago

Hello @karllessard, @saudet, @roywei and zaleslaw and lankin520,

I also could confirm that it is possible to use GPU builds without trouble. But the key question here is whether the GPU builds really use GPU if it is present on a machine. My tests seems to be telling that NO, they are not using GPU. At least so far I don't know how to make it use GPU.

I provided two sample Java programs: The first with simple vector computation where I expected that providing gpu versions of libraries should make TF use GPU, but it was not the case. The second example is after Karl's recommendation, where I additionally specified the device in the graph operations. Also without success.

Now, since this library silently falls back to CPU if GPU is not present or for some reason the library is not willing to use it, then how do you perform the unit tests?

One possibility it by comparing the execution times, but this is really delicate. What are the other possibilities? Because I think all the tests may be successful after the library falls back to CPU, while all of them should fail in case of GPU present on a machine and requested.

In my opinion there should be a possibility in the API to request computation on GPU and if this is not satisfied, then the unit tests and the whole API should raise an exception saying that cannot be executed on requested GPU due to some reason.

Concluding I repeate my original issue - How to make Tensorflow Java API convince to use GPU?

Please point me to any working example or provide any working hint. Could you? PS. By working I understand not only working, but also using GPU if present.

lanking520 commented 3 years ago

@tmichniewski hi, we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device. You can also try with DJL that wrap TF Java package and provide some ease to automatically do GPU switch during runtime.

You can play around with our Jupiter notebook in Java http://docs.djl.ai/jupyter/tensorflow/pneumonia_detection.html

If you have GPU device and install CUDA 10.1, it will use GPU out of box without a single line of the code changes.

tmichniewski commented 3 years ago

Hello @lanking520,

Refering to:

"we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device." seems to be contradicting to your next statement: "If you have GPU device and install CUDA 10.1, it will use GPU out of box without a single line of the code changes."

So far I added setDevice("/GPU:0") to every graph operation like in the code below: HelloTensorFlow_java_2.txt just to be 100% sure where the computation is being executed.

But the results I get on my test computation on 100 mln entries vectors (so big enough) executed 1000 times do not show any time difference when executed on GPU or CPU. In fact the results are almost the same (6.0s. vs. 6.1s.).

On the contrary, when I executed the same test on Python in eager mode I got 4.5s. on GPU and 259s. on CPU, so the difference is huge and visible. Therefore I wonder why in case of using Java API I do not see the same difference? Could you explain it? python_test_2.txt

BTW - I am not comparing Python to Java or eager to graph execution. I am comparing GPU to CPU execution, where in Python I see huge defference while in Java I see no difference.

PS. I do not want to run these DJL samples which bring a lot of new elements to the picture and new changing parameters, so in general this may only complicate my tests instead of simplifying them. Moreover, if DJL wraps TF Java package, then it will change nothing if I do not see that TF java package is working well on GPU.

Maybe it is like this that TF Java API is well optimized to Deep Learning computations while not exactly to very simple vector operations.

tmichniewski commented 3 years ago

In general I think that there might be three possibilities:

  1. First possibility is that I might be doing something wrong in my test program, however I think you had a chance to look at it. Any way, if this is the case, then I simply ask for some support or just a reference to some simple example program, but with vector operations, not with sophisticated neural networks computations.
  2. Second option is that the TF Java API might not be using GPU. In this case I would ask for some hints of how to force it to use GPU.
  3. Finally there also might be some bug in the API which sees GPU and tries to use it but silently falls back to CPU. In this case you will know best how to handle this.

So my question is which situation we are dealing with? Or how to diagnose it?

tmichniewski commented 3 years ago

Hello @karllessard, Refering to DeviceSpec.newBuilder().deviceType(...) - I cannot use this construction because deviceType metod is called on object Builder, which is package-private and cannot be used outside of org.tensorflow package. This is why I could not use it. PS. This is is just a comment, because the code snippet you provided was perfectly valid, but only for you. :-)

zaleslaw commented 3 years ago

Mr. Tomasz, I'm also faced with a similar issue with the API (missed method on operand level, only on builder). Do you have a chance to experiment on the previous version of Java API (1.15, for example)? I found a very big difference for LeNet model training on GPU and without. It could be important to understand what was missed when we switched from 1.x to 2.x.

Agree, that DeviceSpec could not be used directly in API, but could be used as a builder to build a correct device string (if it's required).

Alex

ср, 4 нояб. 2020 г. в 17:02, Tomasz Michniewski notifications@github.com:

Hello @karllessard https://github.com/karllessard, Refering to DeviceSpec.newBuilder().deviceType(...) - I cannot use this construction because deviceType metod is called on object Builder, which is package-private and cannot be used outside of org.tensorflow package. This is why I could not use it. PS. This is is just a comment, because the code snippet you provided was perfectly valid, but only for you. :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/java/issues/140#issuecomment-721749110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJEUHI6MBMQP66UE4CCQWLSOFNGJANCNFSM4TIZUTNQ .

tmichniewski commented 3 years ago

Hello Alex @zaleslaw, 1.15.0 - so the API of Tensorflow 1 which had recent commit in Maven Repository in October 2019 (we skipped this from the very beginning as obsolete), 2.3.0 - so called Lagacy API since last weekend (in Python and Java API+Scala, but this version did not work in Java/Scala on our machines with GPU compute capability 3.7 and 7.0 - it was always saying that image in incorrect and we should build TF from sources), 0.2.0 - the newest API produced on October 2020 (in Java API). I experimented on Azure Databricks clusters, but also on Azure Virtual Machines instantiated on request.

So far I had not a single result with Java API where I could see - OK, computing on GPU is X times faster. Not a single result.

We managed to see such results only in Python, but never in Java API. That is why I started wordering what is wrong. Because basically we should be able to observe that GPU processing is sometimes faster, the same as in Python we see.

At the moment we focus only on this new API, as last week we spent a lot of time on this legacy version 2.3.0. We even started to analyse how to build it from sources.

Well - maybe there is also the same issue like in 2.3.0 - I mean the Java API recognizes that the graphic card compute capability is too low and silently falls back to CPU.

In my opinion I managed to force the API to use the specified GPU. But apparently it is not being used. Maybe this logic to fall back to CPU should be guarded somehow and if the requirement is to process on GPU and GPU cannot be used (is not present or has wrong compute capability), then maybe the error should be thrown, not a warning or nothing. You know, the developer/user requests to process on GPU, then if this is not possible, then "blue screen". Otherwise we end up in situations like this that we discuss what might be going on here.

lanking520 commented 3 years ago

In our test platform, we benchmarked ResNet50 Image classification model, there is a small gap between Python GPU and Java GPU (you can find my issue filed in here). However, if you compare with CPU (AWS EC2 C5.2xlarge) it's around 40ms and GPU (AWS EC2 p3.2xlarge) is around 5.7ms. GPU is in fact much faster than CPU.

We use TF 2.3.1 build with 0.2.0

But it seemed the TF2 doing something weird that causing everything slow. We also did some benchmark on the cpu performance and see TF1 CPU latency is 2/3 of TF2 on Resnet 50 model with Java api. But same problems on Python too... So nothing to blame. :)

In fact based on your comments;

"GPU or CPU. In fact the results are almost the same (6.0s. vs. 6.1s.)."

"when I executed the same test on Python in eager mode I got 4.5s. on GPU and 259s. on CPU"

It seemed you run everything on GPU with TF Java :).

tmichniewski commented 3 years ago

Refering to your last statement, well, I setup another cluster without GPU, so only CPU with similiar core class and got 7.1s.

Then I run this on a laptop with CPU only and also had similiar results like 6.6s. So these results in Java on machines with CPU only are much more close to Python on GPU.

But of course these are just comments on these results. Let us focus on facts. To get the facts I need some method to be able to check where my processing is executed. Is it possible to turn some logging, where the API would tell with 100% certainty on which device it us executing?

You know what, I believe that in your test you have better results on GPU than CPU. But you have a different machine than I. Could you confirm that in my case it is impossible that TF Java API falls back to CPU due to some reason, for example because of compute compatibility. Former Java API version 2.3.0 had problems with such compute capabilities. But new version 0.2.0 might just fall back to CPU instead of issuing error. This is how I interpret my test results. Because GPU and CPU results are almost the same.

Now, the question is how to diagnose it. So far it is far from intuitive.

tmichniewski commented 3 years ago

Is it posibble that TF Java API checks my graphic card Tesla V100 and decides that compute capability 7.0 is wrong/too weak and silently falls back to CPU. Is it possible?

Craigacp commented 3 years ago

TF Java uses the same placement engine and fallback logic as Python, because it all happens in the C layer. I'm trying to replicate your issue at the moment, but it triggered a full TF GPU recompile and it's taking a little while on the test VM I've got.

tmichniewski commented 3 years ago

Hello Adam @Craigacp, Have you got any results?

karllessard commented 3 years ago

Hi @tmichniewski , sorry if I continue to ask but can you please enable log device placement in your session and share with us the output logs? I think it is more reliable than comparing the speed to know what has been placed on a CPU or GPU.

For example, I took your original code and changed try (Graph g = new Graph(); Session s = new Session(g)) {... for this:

ConfigProto config = ConfigProto.newBuilder(ConfigProto.getDefaultInstance()).setLogDevicePlacement(true).build();
try (Graph g = new Graph(); Session s = new Session(g, config)) { ...

and now it gives me the following output:

2020-11-05 09:08:44.176590: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-11-05 09:08:44.176822: I external/org_tensorflow/tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Hello TensorFlow 2.3.1
2020-11-05 09:08:44.406991: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-05 09:08:44.412857: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-05 09:08:44.418527: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-11-05 09:08:44.418553: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (xps): /proc/driver/nvidia/version does not exist
2020-11-05 09:08:44.419212: I external/org_tensorflow/tensorflow/core/common_runtime/direct_session.cc:360] Device mapping: no known devices.
Mul: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539650: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Mul: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
Mul_1: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539674: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Mul_1: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
Div: (Div): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539685: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Div: (Div): /job:localhost/replica:0/task:0/device:CPU:0
Add: (Add): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539694: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Add: (Add): /job:localhost/replica:0/task:0/device:CPU:0
ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539704: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:CPU:0
Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539715: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539724: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539734: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_3: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539744: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_3: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_4: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539753: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_4: (Const): /job:localhost/replica:0/task:0/device:CPU:0
9900320.000000034
Time [s.] 0.16198432
Bye bye

Note that I'm running this on a CPU-only machine.

tmichniewski commented 3 years ago

Hello Karl @karllessard, Sure, I will test it tomorow. PS. Thanks for this snippet, this is the kind of support I was hoping for. Tomek

tmichniewski commented 3 years ago

Hello Karl @karllessard,

I made two tests, both with vectors of 100 mln elements and the computation is executed 1000 times. Here are the results: GPU.txt CPU.txt

Now - does it prove that in first case the computation was executed on GPU? If so, then looking at the results: GPU 6.05s., CPU 5.74s. I worder why CPU is even faster than GPU, or at least GPU is not faster. What is wrong with this test? Maybe the computation is too simple? Maybe vectors are too small. What makes GPU version work as long as CPU one?

OK, I know that TF originally was designed for bigger workloads, for example ML models. But ML computation consists of such numerical computations, but in much bigger scale. So is the scale the problem here? Maybe using GPU generates additional overhead to pass the computation to GPU? Maybe this consumes the advantages of using GPU?

What is wrong with TF Java API used like this example?

BTW - we also noticed that TF is a few times faster than popular NumPy on Python. But in Python we also observed that GPU computation was 4-5 times faster than on CPU. So why in case of Java API the times are roughly the same (time on GPU is more or less the same as on CPU)?

tmichniewski commented 3 years ago

Well, in fact in Python this test (1000 times executed the given expression in eager mode on vectors of 1mln elements) lasted: 0.23s. on GPU 1.13s. on CPU But, when we increased the vector size to 100mln (like in Java example), then we got: 4.50s. on GPU 259.4s. on CPU Isn't it a proof that GPU version is much faster? So why on the same machine we have so strange results in case of Java API?

Firstly, even if there is a bug which reports using GPU or CPU, but under cover always uses GPU (this is also possible), then why Java API is slower than Python (6s. vs. 4.5s.)?

Moreover, when I run these tests with Java API 10000 times I got: 8.0s. on GPU 6.5s. on CPU

How to interpret these results?

tmichniewski commented 3 years ago

Hello Adam @Craigacp, Did you manage to finish your test on this sample program?

saudet commented 3 years ago

Firstly, even if there is a bug which reports using GPU or CPU, but under cover always uses GPU (this is also possible), then why Java API is slower than Python (6s. vs. 4.5s.)?

Make sure XLA is enabled, see issue https://github.com/tensorflow/java/issues/134#issuecomment-721538573

tmichniewski commented 3 years ago

How to use XLA in Java API? In Python this is simple: @tf.function(experimental_compile=True) according to: https://www.tensorflow.org/xla/tutorials/compile?hl=en

karllessard commented 3 years ago

Hi @tmichniewski , we can drill down in the problem a little bit deeper to continue our investigation, thanks for helping. What I would suggest now is to compare the performances if you stop fetching the outputs, to give us a better idea on the impacts of doing memory transfers between CPU and GPU. Maybe Python have better optimizations for doing this, we'll see.

Can you please change your loop for something like this and try again on both GPU and CPU?

 for (int i = 0; i < 1000; i++) {
        s.runner().addTarget(z).run();
 }
karllessard commented 3 years ago

Also, like @Craigacp suggested previously, if your Python comparable is this script, then maybe you want to run in eager mode in Java as well. That would mean that your code would look like something this:

import org.tensorflow.*;
import org.tensorflow.op.Ops;
import org.tensorflow.op.core.Constant;
import org.tensorflow.types.TFloat64;

import java.util.Arrays;

public class HelloTensorFlow {
  public static void main(String[] args) {
    System.out.println("Hello TensorFlow " + TensorFlow.version());

    double[] pi = new double[1_000_000];
    Arrays.fill(pi, 3.14); // Pi
    double[] e = new double[1_000_000];
    Arrays.fill(e, 2.72); // e
    Ops tf = Ops.create();

    // Tensor<TFloat64> aTensor = TFloat64.scalarOf(3.456);
    Constant<TFloat64> a = tf.constant(3.456);
    Constant<TFloat64> b = tf.constant(0.5);
    Constant<TFloat64> x = tf.constant(pi);
    Constant<TFloat64> y = tf.constant(e);

    long startTime = System.nanoTime();

    for (int i = 0; i < 1000; i++) {
      // x * y * a / x + b where a = 3.456 and b = 0.5
      Operand<TFloat64> z = tf.reduceSum(tf.math.add(tf.math.div(tf.math.mul(tf.math.mul(x, y), a), x), b), tf.constant(0));

      // Uncommenting this line will force additional I/O between GPU and CPU, testing both cases could be interesting
      // System.out.println(z.data().getDouble());
    }
    long stopTime = System.nanoTime();
    System.out.println("Time [s.] " + (double)(stopTime - startTime) / 1000_000_000);
  }
}
Craigacp commented 3 years ago

Running @karllessard's snippet in eager mode is definitely using the GPU as it takes up all the GPU memory and errors out (because there hasn't been anything to free the tensors all those operations create).

The Python version you put is using fp32 and the Java is using fp64, so that's not a fair comparison (as CPUs don't slow down as much as GPUs when doing fp64 math).

Comparing the CPU and GPU in Java in graph mode I get roughly the same time for both of them, as with a simple matmul of two 10000x10000 matrices. The device placement fails if you give it a device that doesn't exist, and that's our view into the TF C library, so it might be doing something below the Java API level. The fact that the placement fails suggests it is building the nodes in the right place. Using CUDA_VISIBLE_DEVICES= on the CPU run doesn't make that slower, so it's not ignoring the CPU placement and putting things on the GPU anyway.

Turning off the fetching of the result out of the loop makes both CPU and GPU faster by roughly the same amount.

My recommendation would be to try running a real workload like ResNet inference with and without CUDA_VISIBLE_DEVICES= (note the space after the equals), as that turns on and off the GPU. I don't have time to do such an experiment at the moment, but I might get some time next week.

tmichniewski commented 3 years ago

Hello Karl @karllessard, Referring to first test with "s.runner().addTarget(z).run();". I made it twice (in both cases I increased the size of the vectors to 100_000_000):

  1. First test when I referenced the GraphOperation sum - these operations with direct setDevice. I got following results: FirstTestJava.txt FirstTestGPU.txt FirstTestCPU.txt As you can see the GPU (6.23s.) version is even slightly slower than CPU (6.06s.). Moreover, when I started this test on my Laptop I got 7.02s. (only CPU).
  2. Second test when I referenced the z Operand from eager mode. Then I got following results: SecondTestJava.txt SecondTestGPU.txt SecondTestCPU.txt In this test GPU time 6.26s., CPU time 6.26s., and laptop time 6.80s. Generally GPU is more or less the same as CPU.
tmichniewski commented 3 years ago

Hello Karl @karllessard, Referring to your next longer test, where "Operand z" is instantiated inside the for loop - well, in this case on both my laptop and Databricks machine with GPU I got OutOfMemory error even though there were vectors of 1_000_000 size, while in my previous tests I used 100_000_000 vectors, so 100x bigger. OOM.txt I suspect it might be due to the fact that "z" is instantiated 1000 times inside a loop.

Referring to:

// Uncommenting this line will force additional I/O between GPU and CPU, testing both cases could be interesting
// System.out.println(z.data().getDouble());

I think this test would test time of copying one double value (8 bytes) plus cost of println. In case of copying data between memory of GPU and CPU copying 8 bytes is nothing really serious. That is why the test expression had final reduce to avoid copying vector data between processors.

tmichniewski commented 3 years ago

Hello Adam @Craigacp, I don't want to do bigger tests because bigger and more complicated tests means more changing parameters which influence the results. I want to verify whether it makes sense to use Tensorflow Java API to perform vector operations. And what I expect is that I will see some improvement after running on GPU (as in Python case). But in case of Java API so far my expectations were not confirmed. All these tests seem to prove only that indeed my computation is executed on GPU if such is present, but execution time is more or less the same as on CPU. And I still think that this is strange.

Craigacp commented 3 years ago

Larger tests will allow us to rule out misconfiguration of our compiled version of libtensorflow vs inefficiencies in the way the Java API passes in results to the graphs or the way the graphs/eager functions are called. Running a few batches of inference of a reasonably sized model should amortize all the TF Java calling overheads, and hopefully show a speedup, which would indicate that it's a problem with how we call the libtensorflow C library (or how we pass in arguments). If it doesn't show a speedup, then the way we build libtensorflow might be causing some problems.

At the moment it's hard to say what the root cause of the inefficiency is, so more data will help us narrow it down.

karllessard commented 3 years ago

Ok, I did another test, results confirm what we observed previously, where running on a GPU takes as long as on the CPU (even longer in this case, in fact). I've run the previous graph but within a loop of 5M iterations. See the following code and the results below:

package mypackage;

import org.tensorflow.*;
import org.tensorflow.op.Ops;
import org.tensorflow.op.core.Constant;
import org.tensorflow.op.core.ReduceSum;
import org.tensorflow.proto.framework.ConfigProto;
import org.tensorflow.proto.framework.GPUOptions;
import org.tensorflow.types.TFloat64;

import java.util.Arrays;
import org.tensorflow.types.TInt32;

public class GPUTest {
  public static void main(String[] args) {
    System.out.println("Hello TensorFlow " + TensorFlow.version());

    double[] pi = new double[1_000_000];
    Arrays.fill(pi, 3.14); // Pi
    double[] e = new double[1_000_000];
    Arrays.fill(e, 2.72); // e

    ConfigProto config = ConfigProto.newBuilder(ConfigProto.getDefaultInstance())
        .setLogDevicePlacement(true)
        .build();

    try (Graph g = new Graph(); Session s = new Session(g, config)) {
      Ops tf = Ops.create(g);

      Output<?> loopInput = tf.placeholder(TInt32.DTYPE).output();

      Output<?> loopOutputs[] = g.whileLoop(
           new Output<?>[] { loopInput },
          (cgraph, inputs, outputs) -> {
            Ops ctf = Ops.create(cgraph);
            outputs[0] = ctf.math.less((Output<TInt32>)inputs[0], ctf.constant(1_000_000)).z();
          },
          (bgraph, inputs, outputs) -> {
            Ops btf = Ops.create(bgraph);

            Constant<TFloat64> a = btf.constant(3.456);
            Constant<TFloat64> b = btf.constant(0.5);
            Constant<TFloat64> x = btf.constant(pi);
            Constant<TFloat64> y = btf.constant(e);

            ReduceSum<TFloat64> r = btf.reduceSum(
                btf.math.add(
                    btf.math.div(
                        btf.math.mul(
                            btf.math.mul(x, y), a),
                        x),
                    b),
                btf.constant(0)
            );
            outputs[0] = btf.withControlDependencies(Arrays.asList(r)).math.add((Operand)inputs[0], btf.constant(1)).asOutput();
          },
          "test_gpu_loop"
      );
      long startTime = System.nanoTime();
      try (Tensor<TInt32> c = TInt32.scalarOf(0);
        Tensor<?> tensor = s.runner()
            .feed(loopInput, c)
            .fetch(loopOutputs[0])
            .run()
            .get(0)) {}
      long stopTime = System.nanoTime();
      System.out.println("Time [s.] " + (double)(stopTime - startTime) / 1000_000_000);
    }

    System.out.println("Bye bye");
  }
}
CPU: 11.65383011 secs
GPU: 18.522389901 secs

Now, we can confirm that most of that time is totally spent in the native TF runtime library that has no interaction with the JVM. Device placement logs shown me as well that the ops were placed on the expected device in both cases. So it might be a misconfiguration of the native library build that prevents the GPU kernels to operate at full speed. @saudet already mentioned that XLA support is not enabled by default but there must something else missing.

tmichniewski commented 3 years ago

Hello Karl @karllessard,

Is it possible that even the operations were placed on GPU, they are not executed there due to for example wrong (too small) compute capability of GPU?

My understanding of this whole situation is that it seems that the Java program is doing its best and we confirmed that it should use GPU, but at the moment the computation times prove that something weird is happening and apparently the computation is done on CPU or some other problem extends computation time on GPU. This questions the point of using TF Java API untill the problem is fixed.

zaleslaw commented 3 years ago

I'll share some results from my side: I ran on my local machine with NVIDIA card GeForce GTX 1650 the following example I made minor changes:

Also I set up the session with the following setting:

session = new Session(graph, ConfigProto.newBuilder().setGpuOptions(GPUOptions.newBuilder().setAllowGrowth(true).build()).build());

For 1 epoch:

GPU: 13367 +- 140 ms CPU: 56156 +- 420 ms Speed up ~ 4.2

For 10 epochs:

GPU: 106852 +- 1000 ms CPU: 560519 +- 6900 ms Speed up ~ 5,24

For 20 epochs:

GPU: 214016 +- 1800 ms CPU: 1127356 +- 12500 ms Speed up ~ 5,26

P.S. The speedup for this GPU on 1.15 was closer to 7.2 times on LeNet-like models