Open rnett opened 3 years ago
An extension: tensorflow has mechanisms to automatically JIT sections of functions. See https://www.tensorflow.org/xla#auto-clustering. Look into how to enable this, and document if the Python methods don't work. I'm not seeing it in python code, so the given methods most likely work. There is a session specific version as well (that is overridden by the env variable): https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto#L223
There are option setters here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/c/c_api_experimental.h#L62, but we aren't currently mapping it.
Hi @rnett
I am hitting this issue coming from here: https://github.com/tensorflow/java/blob/9cfea866973cc6c05ba8cb9cb11c023124a5c28d/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/ConcreteFunction.java#L464
However, I am not seeing it on CPUs, but only when I use TensorFlow (0.4.0) on a GPU.
The warning:
2022-09-10 21:25:11.745187: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at xla_ops.cc:248 : NOT_FOUND: could not find registered platform with id: 0x7f0297ff8f14
The full error:
2022-09-10 21:24:53.906018: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /tmp/export_wav2vec2-base
2022-09-10 21:24:54.042842: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:107] Reading meta graph with tags { serve }
2022-09-10 21:24:54.042908: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:148] Reading SavedModel debug info (if present) from: /tmp/export_wav2vec2-base
2022-09-10 21:24:54.043018: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-10 21:24:59.210550: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9758 MB memory: -> device: 0, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:04:00.0, compute capability: 6.0
2022-09-10 21:24:59.213219: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9758 MB memory: -> device: 1, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:05:00.0, compute capability: 6.0
2022-09-10 21:24:59.215647: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 9758 MB memory: -> device: 2, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:06:00.0, compute capability: 6.0
2022-09-10 21:24:59.218087: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 9758 MB memory: -> device: 3, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-09-10 21:24:59.784080: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:228] Restoring SavedModel bundle.
2022-09-10 21:25:01.172674: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:212] Running initialization op on SavedModel bundle at path: /tmp/export_wav2vec2-base
2022-09-10 21:25:01.807065: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:301] SavedModel load for tags { serve }; Status: success: OK. Took 7901067 microseconds.
2022-09-10 21:25:02.839312: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9758 MB memory: -> device: 0, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:04:00.0, compute capability: 6.0
2022-09-10 21:25:02.840745: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9758 MB memory: -> device: 1, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:05:00.0, compute capability: 6.0
2022-09-10 21:25:02.842146: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 9758 MB memory: -> device: 2, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:06:00.0, compute capability: 6.0
2022-09-10 21:25:02.843535: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 9758 MB memory: -> device: 3, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-09-10 21:25:06.693928: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9758 MB memory: -> device: 0, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:04:00.0, compute capability: 6.0
2022-09-10 21:25:06.694915: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9758 MB memory: -> device: 1, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:05:00.0, compute capability: 6.0
2022-09-10 21:25:06.695728: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 9758 MB memory: -> device: 2, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:06:00.0, compute capability: 6.0
2022-09-10 21:25:06.696504: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 9758 MB memory: -> device: 3, name: NVIDIA Tesla P100-PCIE-12GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-09-10 21:25:10.978737: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8401
2022-09-10 21:25:11.745187: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at xla_ops.cc:248 : NOT_FOUND: could not find registered platform with id: 0x7f0297ff8f14
[error] org.tensorflow.exceptions.TensorFlowException: 2 root error(s) found.
[error] (0) NOT_FOUND: could not find registered platform with id: 0x7f0297ff8f14
[error] [[{{function_node __inference_serving1_9887}}{{node wav2vec2/encoder/pos_conv_embed/conv/PartitionedCall}}]]
[error] [[StatefulPartitionedCall/_847]]
[error] (1) NOT_FOUND: could not find registered platform with id: 0x7f0297ff8f14
[error] [[{{function_node __inference_serving1_9887}}{{node wav2vec2/encoder/pos_conv_embed/conv/PartitionedCall}}]]
[error] 0 successful operations.
[error] 0 derived errors ignored.
[error] at org.tensorflow.internal.c_api.AbstractTF_Status.throwExceptionIfNotOK(AbstractTF_Status.java:101)
[error] at org.tensorflow.Session.run(Session.java:850)
[error] at org.tensorflow.Session.access$300(Session.java:82)
[error] at org.tensorflow.Session$Runner.runHelper(Session.java:552)
[error] at org.tensorflow.Session$Runner.runNoInit(Session.java:499)
[error] at org.tensorflow.Session$Runner.run(Session.java:495)
[error] at Main$.delayedEndpoint$Main$1(Main.scala:85)
[error] at Main$delayedInit$body.apply(Main.scala:12)
[error] at scala.Function0.apply$mcV$sp(Function0.scala:39)
[error] at scala.Function0.apply$mcV$sp$(Function0.scala:39)
[error] at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
[error] at scala.App.$anonfun$main$1$adapted(App.scala:80)
[error] at scala.collection.immutable.List.foreach(List.scala:431)
[error] at scala.App.main(App.scala:80)
[error] at scala.App.main$(App.scala:78)
[error] at Main$.main(Main.scala:12)
[error] at Main.main(Main.scala)
[error] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] at java.lang.reflect.Method.invoke(Method.java:498)
Is there a workaround to disable (or enable) the XLA via TF Session in Java or somehow workaround this? (it made the model not usable on the GPU)
PS: thanks for documenting this in the code and here.
Hi @maziyarpanahi, how are you using XLA? That function is private because it does not work (because of this issue), and afaik there's intentionally no other way of using XLA with TF/java because of this.
For any workarounds or fixes to enable you to use XLA on GPU from Java, you'd want to look at https://github.com/tensorflow/tensorflow/issues/50458
Hi @rnett
Thanks for your quick response. I am not actually using XLA in either Python or Java, I think the model I am exporting into SavedModel is (It's Wav2Vec2). I saw the error is very similar to the comments in the source code and the description so I just assumed the model itself might have been using XLA.
If I simply load and do inference with this model on CPUs is fine, but the moment I use GPU it fails with that error:
val model = SavedModelBundle.load(folder, "serve")
// it fails on GPU only during prediction
There is a sample code here (in build.sbt you can select the dependency for CPU or GPU) just in case: https://github.com/maziyarpanahi/wav2vec-tensorflow
Ah, that sounds right. Unfortunately there is nothing we can do on our end for this, it's due to the Tensorflow bug I linked above. You could try re-writing the proto but I'm not sure how exactly you would go about that.
@rnett Would it possible to enable it by setting environment variables in global scope?
No idea, I haven't been working on this project in quite some time.
On Fri, Mar 1, 2024, 8:37 AM austinzh @.***> wrote:
@rnett https://github.com/rnett Would it possible to enable it by setting environment variables in global scope?
— Reply to this email directly, view it on GitHub https://github.com/tensorflow/java/issues/347#issuecomment-1973507796, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHWJKKIE4QZS7R36OG4QEUTYWCVDNAVCNFSM47TI56DKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJXGM2TANZXHE3A . You are receiving this because you were mentioned.Message ID: @.***>
This is mostly a reminder for myself, but also documentation if someone else ends up doing it.
After reading through the python code (see here), I found that to force function JITing you need to set the
_XlaMustCompile
and_noinline
attributes totrue
, as is done here.However, this results in an error. I didn't run it down at the time, but it was reported in https://github.com/tensorflow/tensorflow/issues/50458. Once that is fixed we should try this again.