microsoft / tensorflow-directml-plugin

DirectML PluggableDevice plugin for TensorFlow 2
Apache License 2.0
179 stars 23 forks source link

HRESULT failed with 0x887a0001: dml_device_->GetDeviceRemovedReason() #359

Open HeloWong opened 1 year ago

HeloWong commented 1 year ago

Envs: Tensroflow 2.12 tensorflow_directml_plugin-0.5.0-cp39-cp39-win_amd64.whl Python 3.9

Error: F tensorflow/c/logging.cc:43] HRESULT failed with 0x887a0001: dmldevice->GetDeviceRemovedReason(), and Python Restart

I build newest tensorflow_directml_plugin 0.5.0, but when I run minst example on tf, some error happened: F tensorflow/c/logging.cc:43] HRESULT failed with 0x887a0001: dmldevice->GetDeviceRemovedReason() and GPU memory and shared memory have substantial growth.

FabricatiDiem commented 1 year ago

I have the exact same issue and message running a simple Keras model. Fresh install, etc.

maggie1059 commented 1 year ago

Hi @HeloWong, @FabricatiDiem, would you mind including the models that you saw this issue with? I'm not seeing this repro on the Keras tutorial model for MNIST, so it would be helpful for me to test using the scripts you're seeing this with. Please also double-check that your environment is using keras==2.12, as this latest version of the plugin is not compatible with previous versions of keras.

FabricatiDiem commented 1 year ago

This is a minimal example that somewhat more closely aligns with my actual use case: https://gist.github.com/FabricatiDiem/07b8645faabb1ea0a887550a0544ea9d

Note, the example works without error using WSL2 + Docker. The example also tends to work if I tweak it, such as by removing the sparse representation (not feasible in my real use case), by making the feature space smaller, or by reducing the width of the network. Could be a memory issue, but I'm not seeing any memory-related errors, and if it was that, it should affect the Docker version too (I would think).

Also, just upgrading Keras to 2.12 breaks TF entirely for me. I'm using a fresh install of the latest tensorflow-directml-plugin package, which installs TF 2.10 and a bunch of other stuff. I'm not able to try out the bleeding-edge Github version on my local setup, so if it is already fixed but off-release, then I'm good with my WSL2+Docker setup until there's a new release.

Thanks for looking at the issue.

Edit: For completeness, my NVIDIA system information can be found here: https://gist.github.com/FabricatiDiem/fe0667aff7dc529a9b439112194f34b6

341 looks similar, but I'm not sure.

radudiaconu0 commented 1 year ago

I have the same issue on my AMD GPU with latest driver (23.4.3) on squeeznet example at the epoch 38 and on MNIST exaple at epoch 11. i build the plugin from source with tensorflow-cpu 2.12

NateAGeek commented 1 year ago

Having this issue too... Using an AMD GPU. Using this example: https://github.com/tensorflow/examples/blob/fb13f7e76d50b446b4b395abcdf09bd4aeddb29a/community/en/transformer_chatbot.ipynb

radudiaconu0 commented 1 year ago

any update here?

PatriceVignola commented 8 months ago

I apologize for the delay. We had to pause the development of this plugin until further notice. For the time being, all latest DirectML features and performance improvements are going into onnxruntime for inference scenarios. We'll update this issue if/when things change.