microsoft / DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
MIT License
2.15k stars 286 forks source link

DXGI_ERROR_DEVICE_REMOVED Error #95

Open douglastehling opened 3 years ago

douglastehling commented 3 years ago

hello, i have a problem. I don't know if anyone has had this problem. I have a Vega8, the drivers are all installed correctly but it is giving the error DXGI_ERROR_DEVICE_REMOVED when I try to run the following script.

import tensorflow.compat.v1 as tf
tf.enable_eager_execution (tf.ConfigProto (log_device_placement = True))
print (tf.add ([1.0, 2.0], [3.0, 4.0]))

I've already followed the instructions on the link https://aka.ms/tfdmltimeout but it doesn't work.

2021-03-31 11: 29: 36.810513: I tensorflow / stream_executor / platform / default / dso_loader.cc: 98] Successfully opened dynamic library C: \ Users \ d.belgd \ Miniconda3 \ envs \ directml2 \ lib \ site-packages \ tensorflow_core \ python / directml.bdb07c797e1e1af1b4a42d21c67ce5494d73991459.dll
2021-03-31 11: 29: 36.917148: I tensorflow / core / common_runtime / dml / dml_device_cache.cc: 126] DirectML device enumeration: found 1 compatible adapters.
[PhysicalDevice (name = '/ physical_device: DML: 0', device_type = 'DML')]
2021-03-31 11: 29: 36.920996: I tensorflow / core / platform / cpu_feature_guard.cc: 142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-03-31 11: 29: 36.925428: I tensorflow / core / common_runtime / dml / dml_device_cache.cc: 109] DirectML: creating device on adapter 0 (AMD Radeon (TM) Vega 8 Graphics)
2021-03-31 11: 29: 37.129830: And tensorflow / core / common_runtime / dml / dml_heap_allocator.cc: 53] The DirectML device has encountered an unrecoverable error (DXGI_ERROR_DEVICE_REMOVED). This is most often caused by a timeout occurring on t the GPU. Please visit https://aka.ms/tfdmltimeout for more information and troubleshooting steps.
2021-03-31 11: 29: 37.136448: F tensorflow / core / common_runtime / dml / dml_heap_allocator.cc: 53] HRESULT failed with 0x887a0005: hr

I think this is the problem when I try to run

python detect_video.py --video data/grca-trainmix_1280x720.mp4 --trace --max_frames 10 --headless

WARNING:tensorflow:From detect_video.py:39: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

W0331 13:51:28.546197  3820 module_wrapper.py:139] From detect_video.py:39: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

2021-03-31 13:51:28.806023: I tensorflow/stream_executor/platform/default/dso_loader.cc:98] Successfully opened dynamic library C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python/directml.bdb07c797e1af1b4a42d21c67ce5494d73991459.dll
2021-03-31 13:51:28.933164: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:126] DirectML device enumeration: found 1 compatible adapters.
2021-03-31 13:51:28.936741: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-03-31 13:51:28.940855: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:109] DirectML: creating device on adapter 0 (AMD Radeon(TM) Vega 8 Graphics)
WARNING:tensorflow:From detect_video.py:46: The name tf.RunOptions is deprecated. Please use tf.compat.v1.RunOptions instead.

W0331 13:51:29.155223  3820 module_wrapper.py:139] From detect_video.py:46: The name tf.RunOptions is deprecated. Please use tf.compat.v1.RunOptions instead.

WARNING:tensorflow:From C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0331 13:51:29.190702  3820 deprecation.py:506] From C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Traceback (most recent call last):
  File "detect_video.py", line 148, in <module>
    app.run(main)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\absl\app.py", line 303, in run
    _run_main(main, args)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\absl\app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "detect_video.py", line 65, in main
    yolo.load_weights(FLAGS.weights)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 182, in load_weights
    return super(Model, self).load_weights(filepath, by_name)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 1339, in load_weights
    pywrap_tensorflow.NewCheckpointReader(filepath)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 877, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 889, in __init__
    this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./checkpoints/yolov3.tf: Not found: FindFirstFile failed for: ./checkpoints : The system cannot find the path specified.
; No such process
jstoecker commented 3 years ago

Looks like Radeon Vega 8 is an integrated GPU, and from the logs you shared it looks like it's having trouble allocating memory.

How much system memory (RAM) do you have? If you can provide a dxdiag.txt it would be helpful in understanding the capabilities of your system.

One thing you can try is lowering the default DML heap allocator's allocation size from 4GB to something smaller. For example, you can add these lines to the top of your first script (or set the environment variable elsewhere before the python process launches):

import os
os.environ["TF_DIRECTML_MAX_ALLOC_SIZE"] = "536870912" # 512MB

You can also enabling verbose logging, which will print even more details that might help here. Example:

import os
os.environ["TF_CPP_MIN_VLOG_LEVEL"] = "3"
douglastehling commented 3 years ago

Looks like Radeon Vega 8 is an integrated GPU, and from the logs you shared it looks like it's having trouble allocating memory.

How much system memory (RAM) do you have? If you can provide a dxdiag.txt it would be helpful in understanding the capabilities of your system.

One thing you can try is lowering the default DML heap allocator's allocation size from 4GB to something smaller. For example, you can add these lines to the top of your first script (or set the environment variable elsewhere before the python process launches):

@jstoecker, Thank you for your help. here is the DxDiag.txt file. My PC has 6GB of ram and 2GB of GPU. I tested this parameter here to allocate memory and removed the error. Thanks a lot for the help DxDiag.txt

jstoecker commented 3 years ago

Good to hear, and thanks for the dxdiag! I'll open a bug internally to see if we can improve this experience so it's not necessary to set an environment variable.

adtsai commented 3 years ago

One more thing to add - if you're still seeing the error with the yolov3 sample, don't forget to run setup.py first before trying detect_video.py, because it looks like it's having trouble finding the checkpoint file. :)

douglastehling commented 3 years ago

@jstoecker and @adtsai really with the memory allocation it worked, now one thing I saw, was that detect-video.py is using shared memory and not dedicated memory. Do you know that directml supports access to dedicated memory? I ask this because the detection of the objects is very slow

jstoecker commented 3 years ago

In short: yes, DirectML supports access to dedicated memory!

DirectML itself doesn't allocate memory for GPU resources: that's up to the application/framework using it, such as TensorFlow-DirectML (TFDML) in this case. TFDML has a number of allocators for different purposes, but the bulk of the memory (to store the tensors used in GPU calculations) will be backed by subregions of a so-called default heap. Default heaps reflect different memory pools based on the GPU architecture (UMA or NUMA/discrete).

Your Radeon Vega 8 is an integrated GPU, so the 2GB of dedicated memory you see isn't physical VRAM but rather reserved system memory. In other words, your system actually has 8GB of RAM, but the integrated GPU is claiming 2GB of it for exclusive access. This blog explains some of the differences between dedicated and shared memory, how they are reported in task manager, and some differences between discrete and integrated GPUs in this respect.

Integrated GPUs are, unfortunately, not going to be particularly fast in machine learning. It's worth pointing out that we haven't really optimized TFDML for integrated GPUs (e.g. we could avoid some memory copies since default-heap resources will always live in the "L0" memory pool); however, it's unlikely that you'll see huge performance gains over the CPU without using a more powerful discrete GPU.