microsoft / tensorflow-directml-plugin

DirectML PluggableDevice plugin for TensorFlow 2
Apache License 2.0
179 stars 23 forks source link

The DirectML device has encountered an unrecoverable error (DXGI_ERROR_DEVICE_REMOVED) #341

Closed Neizvestnyj closed 1 year ago

Neizvestnyj commented 1 year ago

System:

Log:

2022-12-19 02:47:33.700540: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-19 02:47:33.701303: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (Radeon RX 580 Series)
2022-12-19 02:47:33.791806: I tensorflow/c/logging.cc:34] Successfully opened dynamic library Kernel32.dll
2022-12-19 02:47:33.793608: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-19 02:47:33.793865: W tensorflow/core/common_runtime/pluggable_device/pluggable_device_bfc_allocator.cc:28] Overriding allow_growth setting because force_memory_growth was requested by the device.
2022-12-19 02:47:33.794200: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14004 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2022-12-19 02:47:45.714334: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-19 02:47:46.246620: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-19 02:47:46.247027: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14004 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2022-12-19 02:47:46.252054: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-19 02:47:46.252296: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14004 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2022-12-19 02:47:46.255226: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-19 02:47:46.255469: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14004 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2022-12-19 02:47:46.260202: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-19 02:47:46.260444: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14004 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2022-12-19 02:47:46.263423: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-19 02:47:46.263817: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14004 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2022-12-19 02:47:46.405264: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-19 02:47:46.405505: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14004 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2022-12-19 02:48:32.256535: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
 6/25 [======>.......................] - ETA: 56s2022-12-19 02:49:06.291523: E tensorflow/c/logging.cc:40] The DirectML device has encountered an unrecoverable error (DXGI_ERROR_DEVICE_REMOVED). This is most often caused by a timeout occurring on the GPU. Please visit https://aka.ms/tfdmltimeout for more information and troubleshooting steps.
2022-12-19 02:49:06.291925: F tensorflow/c/logging.cc:43] HRESULT failed with 0x887a0005: readback_heap->Map(0, nullptr, &readback_heap_data)

I use this model

This error occurs every other time:

import os
os.environ["TF_DIRECTML_MAX_ALLOC_SIZE"] = "536870912"

does not help

Neizvestnyj commented 1 year ago

AMD-related error, AMD Adrenalin driver version 22.11.2 is not working correctly, you need to use AMD Adrenalin 22.2.3