microsoft / tensorflow-directml-plugin

DirectML PluggableDevice plugin for TensorFlow 2
Apache License 2.0
185 stars 25 forks source link

Multiple OpKernel registrations error #221

Closed MachaLvl99 closed 2 years ago

MachaLvl99 commented 2 years ago

Hello all! I've run into a bit of a problem, and the only thing I can think of is a potential issue for this plugin. Keep in mind I know very little about how to identify what could be causing this issue and where this issue could be coming from. I've actually been trying to learn how to use tensorflow/keras, so debugging a system error while I barely know much about tensorflow isn't helping.

ISSUE: When I ran a program I was working on, it gave me this error talking about some multiple OpKernel registrations. When I pulled basic code from tensorflow's site (https://www.tensorflow.org/guide/gpu), I received the same error. Code & full output below: `# %% tf.debugging.set_log_device_placement(True)

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) c = tf.matmul(a, b)

print(c)

Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0 Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0 2022-07-05 21:07:23.825481: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-07-05 21:07:23.828630: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (NVIDIA GeForce RTX 3070) 2022-07-05 21:07:24.453938: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2022-07-05 21:07:24.453984: W tensorflow/core/common_runtime/pluggable_device/pluggable_device_bfc_allocator.cc:28] Overriding allow_growth setting because force_memory_growth was requested by the device. 2022-07-05 21:07:24.454008: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6838 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: )

InvalidArgumentError Traceback (most recent call last) Input In [2], in <cell line: 6>() 4 a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) 5 b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) ----> 6 c = tf.matmul(a, b) 8 print(c)

File ~/anaconda3/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback..error_handler(*args, **kwargs) 151 except Exception as e: 152 filtered_tb = _process_traceback_frames(e.traceback) --> 153 raise e.with_traceback(filtered_tb) from None 154 finally: 155 del filtered_tb

File ~/anaconda3/lib/python3.9/site-packages/tensorflow/python/framework/ops.py:7164, in raise_from_not_ok_status(e, name) 7162 def raise_from_not_ok_status(e, name): 7163 e.message += (" name: " + name if name is not None else "") -> 7164 raise core._status_to_exception(e) from None

InvalidArgumentError: Multiple OpKernel registrations match NodeDef at the same priority '{{node MatMul}}': 'op: "MatMul" device_type: "GPU" constraint { name: "T" allowed_values { list { type: DT_FLOAT } } }' and 'op: "MatMul" device_type: "GPU" constraint { name: "T" allowed_values { list { type: DT_FLOAT } } }' [Op:MatMul] `

NOTES: I'm using an Ubuntu WSL via VScode (the window is attached as a remote container) and their embedded jupyter notebook to run this. If I use a Docker image or a remote container built through VScode, everything works as expected. This error only appears within a WSL container. What's strange is that if I DON'T use VScode, terminal won't read my GPU at all if I just run basic ipython. GPU so far has only been discoverable using a python 3.9 environment on VScode's jupyter notebook. No other python environments seem to detect my GPU, nor does it matter which terminal I access it from.

I'm not sure what to do or how to approach this. I'm so unfamiliar with this I can't tell if this is an issue with directml plugin or something I apologize if this isn't an error with the directml-plugin. I figured to try to post the issue here since this project is pretty new, at least for the programs I'm used to dealing with. However, knowing how little I actually know, I could have easily missed something somewhere to cause this issue. In the meantime, should I just stick to running regular containers? If so, how am I supposed to have the container detect my GPU, or rather how do I set up an environment that's reflective of my PC's capabilities?

PatriceVignola commented 2 years ago

The Multiple OpKernel registrations match NodeDef at the same priority error is a known issue that we are currently investigating. In the meantime, to avoid this issue, we suggest using the tensorflow-cpu pip package instead of tensorflow since tensorflow-directml-plugin doesn't need the latter to work.

PatriceVignola commented 2 years ago

Closing this as a duplicate of https://github.com/microsoft/tensorflow-directml-plugin/issues/216. Please follow the other thread for any updates on this issue.