RNNs only work with layer-wrapped cells on AMD GPU

Dbenjamy commented 2 years ago

When running the example RNN notebook from tensorflow, I got the following error:

NotFoundError: Exception encountered when calling layer "lstm_1" (type LSTM).

Could not find device for node: {{node CudnnRNN}} = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", is_training=true, rnn_mode="lstm", seed=0, seed2=0]
All kernels registered for op CudnnRNN:
  <no registered kernels>
 [Op:CudnnRNN]

Call arguments received by layer "lstm_1" (type LSTM):
  • inputs=tf.Tensor(shape=(20, 10, 50), dtype=float32)
  • mask=None
  • training=None
  • initial_state=None

After looking around, I found that Microsoft's installation tutorial mentions this in a note in step 5:

! Note If your training scripts hardcode the device string to something other than "GPU", that might throw errors.

The notebook example says the built-in layers like keras.layers.LSTM(*args) use CuDNN kernals by default, so I used wrapped-cell layers instead, keras.layers.RNN(keras.layers.LSTMCell(*args), *args), which worked just fine and uses my GPU.

It looks like the built-in layers are looking for a compatible CuDNN GPU. Since I have an AMD GPU, it fails. When I use the wrapped-cell layers, it doesn't make that assumption and the tensorflow-directml-plugin is able to work as intended.

I was able to get the built-in layer to use the generic GPU kernel by making it fail the requirements for CuDNN (i.e. I set activation='sigmoid' instead of tanh) and it was able to use my GPU using the directml-plugin, but that seems like a strange workaround since the point of built-in layers is to be simple and I might like the default configuration. I wasn't able to find an option to make the built-in layers use a generic GPU kernel, aside from making them fail the criteria. If there is a better way feel free to let me know.

Do you know if there are plans to update the built-in layers or add a workaround, or is the plan to use wrapped-cell layers as the main workaround?

maggie1059 commented 2 years ago

Thanks for bringing this to our attention! We'll look into a way to avoid this default behavior for pluggable devices and follow up here with any updates.

enzomich commented 2 years ago

@Dbenjamy I managed to disable the use of CuDNN with the sample code on the same page but under "Using CuDNN kernels when available", changing the line model = build_model(allow_cudnn_kernel=True) into model = build_model(allow_cudnn_kernel=False) Unfortunately, at the subsequent call of model.fit(...) the program dies after the first iteration with another error:

PS Z:\Projects\Tensorflow-test\mnist-rnn> .\main.py
2022-09-16 17:42:24.808496: I tensorflow/c/logging.cc:34] Successfully opened dynamic library C:\Users\-\AppData\Roaming\Python\Python310\site-packages\tensorflow-plugins/directml/directml.0de2b4431c6572ee74152a7ee0cd3fb1534e4a95.dll
2022-09-16 17:42:24.809742: I tensorflow/c/logging.cc:34] Successfully opened dynamic library dxgi.dll
2022-09-16 17:42:24.813393: I tensorflow/c/logging.cc:34] Successfully opened dynamic library d3d12.dll
2022-09-16 17:42:24.967021: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.
2022-09-16 17:42:26.720466: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-16 17:42:26.721149: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (Intel(R) HD Graphics 520)
2022-09-16 17:42:26.807836: I tensorflow/c/logging.cc:34] Successfully opened dynamic library Kernel32.dll
2022-09-16 17:42:26.809218: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-09-16 17:42:26.809510: W tensorflow/core/common_runtime/pluggable_device/pluggable_device_bfc_allocator.cc:28] Overriding allow_growth setting because force_memory_growth was requested by the device.
2022-09-16 17:42:26.810456: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6997 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2022-09-16 17:42:26.933651: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-09-16 17:42:26.933969: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6997 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2022-09-16 17:42:26.936880: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-09-16 17:42:28.883714: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
  1/938 [..............................] - ETA: 42:12 - loss: 2.7705 - accuracy: 0.0469Traceback (most recent call last):
  File "Z:\Projects\Tensorflow-test\mnist-rnn\main.py", line 51, in <module>
    model.fit(
  File "C:\Users\-\AppData\Roaming\Python\Python310\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Program Files\Python310\lib\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'gradient_tape/sequential/rnn/while/gradients/sequential/rnn/while/lstm_cell/mul_2_grad/BroadcastGradientArgs' defined at (most recent call last):
    File "Z:\Projects\Tensorflow-test\mnist-rnn\main.py", line 51, in <module>
      model.fit(
    File "C:\Users\-\AppData\Roaming\Python\Python310\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\-\AppData\Roaming\Python\Python310\site-packages\keras\engine\training.py", line 1409, in fit
      tmp_logs = self.train_function(iterator)
    File "C:\Users\-\AppData\Roaming\Python\Python310\site-packages\keras\engine\training.py", line 1051, in train_function
      return step_function(self, iterator)
    File "C:\Users\-\AppData\Roaming\Python\Python310\site-packages\keras\engine\training.py", line 1040, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\-\AppData\Roaming\Python\Python310\site-packages\keras\engine\training.py", line 1030, in run_step
      outputs = model.train_step(data)
    File "C:\Users\-\AppData\Roaming\Python\Python310\site-packages\keras\engine\training.py", line 893, in train_step
      self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "C:\Users\-\AppData\Roaming\Python\Python310\site-packages\keras\optimizers\optimizer_v2\optimizer_v2.py", line 537, in minimize
      grads_and_vars = self._compute_gradients(
    File "C:\Users\-\AppData\Roaming\Python\Python310\site-packages\keras\optimizers\optimizer_v2\optimizer_v2.py", line 590, in _compute_gradients
      grads_and_vars = self._get_gradients(tape, loss, var_list, grad_loss)
    File "C:\Users\-\AppData\Roaming\Python\Python310\site-packages\keras\optimizers\optimizer_v2\optimizer_v2.py", line 471, in _get_gradients
      grads = tape.gradient(loss, var_list, grad_loss)
Node: 'gradient_tape/sequential/rnn/while/gradients/sequential/rnn/while/lstm_cell/mul_2_grad/BroadcastGradientArgs'
Incompatible shapes: [0,0] vs. [64,64]
         [[{{node gradient_tape/sequential/rnn/while/gradients/sequential/rnn/while/lstm_cell/mul_2_grad/BroadcastGradientArgs}}]] [Op:__inference_train_function_1986]
PS Z:\Projects\Tensorflow-test\mnist-rnn>

On the other hand, after uninstalling tensorflow-directml-plugin everything works fine, with or without allow_cudnn_kernel=False (obviously, as I'm running tensorflow-CPU).

Note: my GPU is a Intel HD Graphics 520.

zznww commented 1 year ago

same problem here. but if you uninstall tensorflow-directml-plugin, is it defeating the purpose of using directML by only using CPU? not sure if it relates to @Dbenjamy's AMD GPU as we've be using intel HD graphics GPUs.

kii1u20 commented 1 year ago

I experienced the same problem. I am trying to run keras-ocr through the directml plugin, but it fails with the same error. Any updates on the issue?

maggie1059 commented 1 year ago

Hi, this issue is on our radar but we don't currently have an ETA for the fix yet. Please refer to the workarounds suggested by @Dbenjamy for the time being, and we will post here when we do have updates. Thanks for your patience!

benzene37 commented 1 year ago

@maggie1059 Any update on this issue? Thanks for the work you are doing.

Hoernchen commented 1 year ago

It is rather unfortuante that there is still no fix for this, because finding the workaround mentioned in this issue takes quite some time...

leo-smi commented 1 year ago

we need to fix this, I'm not gonna buy a NVIDIA

Carter2565 commented 11 months ago

Temporary workaround:

Just came across This comment on the TensorFlow Repo. In the comments, a user suggests using tf.compat.v1.keras.layers rather than tf.keras.layers. In my case:

I was having the same issue. I needed to use tf.compat.v1.keras.layers.GRU(128, return_sequences=True) rather than keras.layers.GRU(128, return_sequences=True) when defining my model:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=32, output_dim=64, input_length=32),
    tf.keras.layers.Bidirectional(tf.compat.v1.keras.layers.GRU(128, return_sequences=True)),
    tf.keras.layers.Dense(32, activation='softmax')
])

From my understanding this uses the layer from TensorFlow v1 hence why i say temporary. Please correct Me as needed. I plan to try this fix on my AMD GPU in a week or so and will update this then.

mkchong0710 commented 4 months ago

Temporary workaround:

Just came across This comment on the TensorFlow Repo. In the comments, a user suggests using tf.compat.v1.keras.layers rather than tf.keras.layers. In my case:
I was having the same issue. I needed to use tf.compat.v1.keras.layers.GRU(128, return_sequences=True) rather than keras.layers.GRU(128, return_sequences=True) when defining my model:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=32, output_dim=64, input_length=32),
    tf.keras.layers.Bidirectional(tf.compat.v1.keras.layers.GRU(128, return_sequences=True)),
    tf.keras.layers.Dense(32, activation='softmax')
])
From my understanding this uses the layer from TensorFlow v1 hence why i say temporary. Please correct Me as needed. I plan to try this fix on my AMD GPU in a week or so and will update this then.

Still working as of this date. I tried using this method from what ur post shared on LSTM and also GRU, both still works. on TensorFlow 2.10 version and AMD GPU

microsoft / tensorflow-directml-plugin

RNNs only work with layer-wrapped cells on AMD GPU #220