microsoft / tensorflow-directml-plugin

DirectML PluggableDevice plugin for TensorFlow 2
Apache License 2.0
188 stars 27 forks source link

InvalidArgumentError: Graph execution error #364

Open Alphachain-Capital opened 1 year ago

Alphachain-Capital commented 1 year ago

Hello,

I am trying to run a model and getting this error where Tensorflow is trying to use the CUDA-based CudnnRNN operation, which is not available because I'm running TensorFlow with DirectML, not CUDA. I have a NVIDIA Geforece RTX 3070 that I am trying to use as the GPU. Anyone come across this issue before that can assist?

Here is the error: WARNING:tensorflow:AutoGraph could not transform <function Model.make_train_function..train_function at 0x00000296030349D8> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: 'arguments' object has no attribute 'posonlyargs' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING: AutoGraph could not transform <function Model.make_train_function..train_function at 0x00000296030349D8> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: 'arguments' object has no attribute 'posonlyargs' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

InvalidArgumentError: Graph execution error:

No OpKernel was registered to support Op 'CudnnRNN' used by {{node CudnnRNN}} with these attrs: [seed=0, dropout=0, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=0, is_training=true] Registered devices: [CPU, GPU] Registered kernels:

[[CudnnRNN]] [[sequential_3/lstm_4/PartitionedCall]] [Op:__inference_train_function_15221]
TeaTimeChimp commented 1 year ago

I ran into a similar issue a few weeks ago whilst trying to get TF>2.10 to use a GPU on Windows. I resorted to compiling TF and TFDML from source and debugging it - so far as I could see the DirectML plugin does not register the 'CudnnRNN' OpKernel and hence the error. It appears this is an optimized kernel for vanilla LSTMs with parameters constrained as per TF docs. If you deviate from these params then I found that TF doesn't use the 'CudnnRNN' kernel and will use a non-optimal kernel(s) that do work - for example if you add a non-zero 'recurrent_dropout" to the LSTM layer. I noticed there is a branch which looks like it will include the 'CudnnRNN' OpKernel, but downloading and compiling that, it looks very much like a work in progress - you need to fix a few bugs to get it out of the starting blocks but it clearly needs work. I didn't find much help on this issue myself, so I hope my comments are of some use!

KiTant commented 1 year ago

If I or anyone want to wait for it to be fixed, how long will I or anyone have to wait?

TeaTimeChimp commented 1 year ago

Who's to say? This is the branch I was referring to (https://github.com/microsoft/tensorflow-directml-plugin/tree/user/wumaggie/cudnn-kernels) at this time it's 9 months old - so sad to say this project appears low priority or at worst dead,

PatriceVignola commented 1 year ago

I apologize for the delay. We had to pause the development of this plugin until further notice. For the time being, all latest DirectML features and performance improvements are going into onnxruntime for inference scenarios. We'll update this issue if/when things change.