microsoft / DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
MIT License
2.19k stars 291 forks source link

Intel hd4600 graphics can not run tensorflow-directml programs,but hd530 works fine. #33

Closed wangyuddd000 closed 4 years ago

wangyuddd000 commented 4 years ago

On hd4600 the console showing: WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:50: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\contrib\layers\python\layers\layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use layer.__call__ method instead. WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\contrib\layers\python\layers\layers.py:1066: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.BatchNormalization instead. In particular, tf.control_dependencies(tf.GraphKeys.UPDATE_OPS) should not be used (consult the tf.keras.layers.batch_normalization documentation). WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:218: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead.

WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\ops\losses\losses_impl.py:121: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:219: The name tf.losses.get_total_loss is deprecated. Please use tf.compat.v1.losses.get_total_loss instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:220: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:223: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:223: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:234: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:238: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:248: The name tf.nn.xw_plus_b is deprecated. Please use tf.compat.v1.nn.xw_plus_b instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:254: The name tf.losses.mean_squared_error is deprecated. Please use tf.compat.v1.losses.mean_squared_error instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:258: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:267: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-07-15 21:43:32.488885: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 1 compatible adapters. 2020-07-15 21:43:32.501242: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2020-07-15 21:43:32.503347: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 0 (Intel(R) HD Graphics 4600) 2020-07-15 21:43:32.621172: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:269: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:271: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

2020-07-15 21:44:13.976556: F tensorflow/core/common_runtime/dml/dml_command_recorder.cc:372] Check failed: (((HRESULT)((dmldevice->GetDeviceRemovedReason()))) >= 0) == true (0 vs. 1)

On hd530 the console showing: WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:50: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\contrib\layers\python\layers\layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use layer.__call__ method instead. WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\contrib\layers\python\layers\layers.py:1066: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.BatchNormalization instead. In particular, tf.control_dependencies(tf.GraphKeys.UPDATE_OPS) should not be used (consult the tf.keras.layers.batch_normalization documentation). WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:218: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead.

WARNING:tensorflow:From D:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\ops\losses\losses_impl.py:121: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:219: The name tf.losses.get_total_loss is deprecated. Please use tf.compat.v1.losses.get_total_loss instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:220: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:223: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:223: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:234: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:238: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:248: The name tf.nn.xw_plus_b is deprecated. Please use tf.compat.v1.nn.xw_plus_b instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:254: The name tf.losses.mean_squared_error is deprecated. Please use tf.compat.v1.losses.mean_squared_error instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:258: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:267: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-07-15 21:48:28.272086: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 2 compatible adapters. 2020-07-15 21:48:28.274903: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2020-07-15 21:48:28.275889: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 0 (Intel(R) HD Graphics 530) 2020-07-15 21:48:28.346440: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll 2020-07-15 21:48:28.366793: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 1 (Intel(R) HD Graphics 530) WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:269: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From D:\work\EkNiCuMine\train\tc7.py:271: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

2020-07-15 21:49:10.806959 number iterations: 0 cost is: 2.8335354 accuracy is: 0.0

adtsai commented 4 years ago

Hi, it appears that the HD4600 is running into a device removal error. Device removal can sometimes be triggered by driver errors, but more often are due to the GPU timing out. The GPU can time out if an extremely large workload is submitted all at once, causing the device to appear hung. Does the same error occur if you try a smaller or less complex model?

For example if you run the following simple example, does it still return the same error on the HD4600?

 import tensorflow as tf
 tf.debugging.set_log_device_placement(True)
 tf.enable_eager_execution()

 a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
 b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
 c = tf.matmul(a, b)
wangyuddd000 commented 4 years ago

Hi, it appears that the HD4600 is running into a device removal error. Device removal can sometimes be triggered by driver errors, but more often are due to the GPU timing out. The GPU can time out if an extremely large workload is submitted all at once, causing the device to appear hung. Does the same error occur if you try a smaller or less complex model? For example if you run the following simple example, does it still return the same error on the HD4600? import tensorflow as tf tf.debugging.set_log_device_placement(True) tf.enable_eager_execution()

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) c = tf.matmul(a, b)

works fine now: WARNING:tensorflow:From D:\work\EkNiCuMine\train\t1.py:3: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

2020-07-17 12:00:37.609541: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 1 compatible adapters. 2020-07-17 12:00:37.610001: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2020-07-17 12:00:37.611202: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 0 (Intel(R) HD Graphics 4600) 2020-07-17 12:00:37.619251: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll 2020-07-17 12:00:37.623759: I tensorflow/core/common_runtime/eager/execute.cc:571] Executing op MatMul in device /job:localhost/replica:0/task:0/device:DML:0 tf.Tensor( [[22. 28.] [49. 64.]], shape=(2, 2), dtype=float32) But why?hd530 can works perfectly.

adtsai commented 4 years ago

It does look like the workload might just be a little too much for the HD4600 to handle. Because it's an older GPU, it might not have enough power to handle training a large model without timing out. I believe the HD530 is a newer chip with more compute power, which would explain why it works but the HD4600 doesn't. You can try reducing the complexity of your model to try to get it to run on the HD4600, but it may be simpler to stick to using the HD530 if you have it available.

adtsai commented 4 years ago

@wangyuddd000 We've released a new version tensorflow-directml 1.15.3.dev200911 which should address this. The timeout behavior has been adjusted to allow for larger and more complex models to run on older GPUs without timing out.