tensorflow / java

Java bindings for TensorFlow
Apache License 2.0
813 stars 200 forks source link

TF Java 0.3.1 shows a performance degradation on GPU compared to v 0.2.0 when loading Hugging Face models #325

Open wolliq opened 3 years ago

wolliq commented 3 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior The usage of version TF Java bindings 0.3.1 degradates performances of a 3x factor on GPU compared to version 0.2.0 .

Describe the expected behavior Equal, hopefully better performances while migrating to newer versions.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. Performance tests are currently on going to validate the issue. We'll update with more info asap. https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/test/scala/com/johnsnowlabs

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

saudet commented 3 years ago

I've heard stories about cuDNN 8.x slowing things down. Could you try with cuDNN 7.x?

Craigacp commented 3 years ago

Do you know if it's during tensor creation/access, model inference (and is that model trained in Python or Java), training or some other place? Also is this across multiple different models or just a single one? Finally is this slowdown observed for a single run, or does it persist after the JVM has warmed up the codepath?

maziyarpanahi commented 3 years ago

Hi @Craigacp @saudet

I can answer some of the questions based on my personal tests:

For the cuDNN versions, I personally tested on Databricks series 8.x runtimes which come with CUDA 11.x and Google Colab. (not sure the exact cuDNN version on these two platforms, but I'll see if I can find a local GPU server to test different cuDNN versions - TensorFlow 2.5 has an upgraded cuDNN so I will be testing that once it's a snapshot here)

karllessard commented 3 years ago

@maziyarpanahi , @wolliq , I suppose you work with String tensors, right? Something that has changed drastically between TF Java 0.2.0 and 0.3.1 is how string tensors are allocated by the TensorFlow runtime (since version 2.4.0). In your experiments, does the model inference also include the allocation of your input tensors?

If so, it could be great if you can isolate just the creation of tensors (without actually running the model) and see if only this takes 3x times slower than 0.2.0.

maziyarpanahi commented 3 years ago

Thanks @karllessard for the reply.

We only use TInt32 tensors for those models (token ids, segment ids, and mask ids). I think the only place we use String Tenors is inside Universal Sentence Encoder. I haven't timed that between 0.2.x and 0.3.x yet.

Any suggestion for TInt32 tensors? If I can find a machine with a compatible GPU I will do some profiling to see where it spends more time compared to 0.2.x

karllessard commented 3 years ago

Well something else major that has change is that now the tensor memory (of all types) is automatically mapped in the JVM, while in 0.2.0 it was only done when calling tensor.data().

This mapping should be pretty fast though, but I’d still be curious to know if the latency you observed is happening at the tensor allocation or at the session run.

maziyarpanahi commented 3 years ago

Hi @karllessard

I have some updates:

Many thanks again

karllessard commented 3 years ago

Ok, it looks like the latency is not happening in TF Java but in TensorFlow itself. I've just done a very quick search and found a few latency issues observed also by non-Java users between TF 2.4 (0.3.1) and TF 2.3 (0.2.0), like this one:

https://github.com/tensorflow/tensorflow/issues/46515

I'd suggest you to take a look if their investigations lead to the same issue that you are facing.

Also we'll migrate soon to TF2.5 in the current snapshot, I don't know if that will fix it but would be worth trying when it is done.