microsoft / tensorflow-directml-plugin

DirectML PluggableDevice plugin for TensorFlow 2
Apache License 2.0
179 stars 23 forks source link

ValueError: Received incompatible tensor with shape (1, 1, 256, 128) #371

Open leo-smi opened 9 months ago

leo-smi commented 9 months ago

I'm with an error running the original VOC2012 example for training a object detection.

I re-trained the example partially to generate a checkpoint for quick test and then I ran train_voc.py file for generating the checkpoint model yolov3_train_1.tf.

If I use the original yolov3.tf (came running the setup.py file) in the detect.py works fine, but I don't know what is happening with the new checkpoint that I trained (yolov3_train_1.tf).

with the the original checkpoint (yolov3.tf) I get the log:

I0923 21:50:23.414862 16292 server.py:122] listener closed
I0923 21:50:23.414862 16292 server.py:270] server has terminated
2023-09-23 21:50:26.293868: I tensorflow/c/logging.cc:34] Successfully opened dynamic library C:\Users\leand\.conda\envs\yolo_env\lib\site-packages\tensorflow-plugins/directml/directml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.dll
2023-09-23 21:50:26.294425: I tensorflow/c/logging.cc:34] Successfully opened dynamic library dxgi.dll
2023-09-23 21:50:26.296333: I tensorflow/c/logging.cc:34] Successfully opened dynamic library d3d12.dll
2023-09-23 21:50:26.382625: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.
2023-09-23 21:50:27.042213: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-23 21:50:27.042797: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (Intel(R) Iris(R) Xe Graphics)
2023-09-23 21:50:27.089313: I tensorflow/c/logging.cc:34] Successfully opened dynamic library Kernel32.dll
2023-09-23 21:50:27.090007: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-23 21:50:27.090312: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8805 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
I0923 21:50:30.521184 15116 detect.py:42] weights loaded
I0923 21:50:30.521184 15116 detect.py:45] classes loaded
2023-09-23 21:50:30.813031: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-23 21:50:30.813582: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8805 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2023-09-23 21:50:30.819849: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-23 21:50:30.820059: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8805 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2023-09-23 21:50:30.844042: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-23 21:50:30.844378: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8805 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2023-09-23 21:50:30.855851: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-23 21:50:30.856020: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8805 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2023-09-23 21:50:30.861914: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-23 21:50:30.862240: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8805 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2023-09-23 21:50:30.886910: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-23 21:50:30.887077: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8805 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2023-09-23 21:50:30.890870: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-23 21:50:30.891261: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8805 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
...
2023-09-23 21:50:30.932467: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-23 21:50:30.932625: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8805 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2023-09-23 21:50:30.953731: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-23 21:50:30.953925: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8805 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
I0923 21:50:30.953699 15116 detect.py:62] time: 0.348203182220459
I0923 21:50:30.953699 15116 detect.py:64] detections:
I0923 21:50:30.975726 15116 detect.py:66]   car, 0.9748033285140991, [0.24280313 0.2940823  0.37936452 0.39064947]
I0923 21:50:30.975726 15116 detect.py:66]   bird, 0.9257763028144836, [0.42066154 0.05204896 0.5695095  0.14767769]
I0923 21:50:30.975726 15116 detect.py:66]   bus, 0.8974265456199646, [0.02114373 0.2955287  0.18957907 0.4136036 ]
I0923 21:50:30.975726 15116 detect.py:66]   pottedplant, 0.8646465539932251, [0.01896213 0.7604627  0.18410942 0.9198695 ]
I0923 21:50:30.975726 15116 detect.py:66]   motorbike, 0.7518333196640015, [0.60587585 0.5450276  0.73023164 0.6556344 ]
I0923 21:50:30.991388 15116 detect.py:66]   dog, 0.7247338891029358, [0.20027779 0.52045083 0.3819927  0.8443599 ]
I0923 21:50:30.991388 15116 detect.py:66]   person, 0.6809608340263367, [0.8091381  0.51088405 0.9747042  0.6712277 ]
I0923 21:50:30.991388 15116 detect.py:66]   motorbike, 0.6792984008789062, [0.708943   0.53600967 0.7793795  0.6508231 ]
I0923 21:50:30.991388 15116 detect.py:66]   tvmonitor, 0.6079525351524353, [0.8371077  0.77706623 0.9707338  0.90270257]
I0923 21:50:31.006989 15116 detect.py:66]   bicycle, 0.5827189683914185, [0.22785279 0.03375612 0.3761612  0.17757493]
I0923 21:50:31.006989 15116 detect.py:66]   dog, 0.5704740881919861, [0.38914445 0.5364528  0.51281893 0.6639898 ]
I0923 21:50:31.107016 15116 detect.py:73] output saved to: ./output.jpg

with the trained chepoint (yolov3_train_1.tf.) I get:

2023-09-23 21:54:25.902047: I tensorflow/c/logging.cc:34] Successfully opened dynamic library C:\Users\leand\.conda\envs\yolo_env\lib\site-packages\tensorflow-plugins/directml/directml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.dll
2023-09-23 21:54:25.902577: I tensorflow/c/logging.cc:34] Successfully opened dynamic library dxgi.dll
2023-09-23 21:54:25.904550: I tensorflow/c/logging.cc:34] Successfully opened dynamic library d3d12.dll
2023-09-23 21:54:26.003354: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.
2023-09-23 21:54:26.661228: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-23 21:54:26.662240: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (Intel(R) Iris(R) Xe Graphics)
2023-09-23 21:54:26.703601: I tensorflow/c/logging.cc:34] Successfully opened dynamic library Kernel32.dll
2023-09-23 21:54:26.704314: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-23 21:54:26.704649: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8805 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027AF6F1A710> and <keras.layers.activation.leaky_relu.LeakyReLU object at 0x0000027AF6F1AEF0>).
W0923 21:54:29.208242  2608 restore.py:84] Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027AF6F1A710> and <keras.layers.activation.leaky_relu.LeakyReLU object at 0x0000027AF6F1AEF0>).
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.convolutional.conv2d.Conv2D object at 0x0000027AF708B430> and <keras.layers.reshaping.zero_padding2d.ZeroPadding2D object at 0x0000027AF701D990>).
W0923 21:54:29.208242  2608 restore.py:84] Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.convolutional.conv2d.Conv2D object at 0x0000027AF708B430> and <keras.layers.reshaping.zero_padding2d.ZeroPadding2D object at 0x0000027AF701D990>).
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027AF708BA60> and <keras.layers.convolutional.conv2d.Conv2D object at 0x0000027AF708B430>).
W0923 21:54:29.208242  2608 restore.py:84] Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027AF708BA60> and <keras.layers.convolutional.conv2d.Conv2D object at 0x0000027AF708B430>).
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.convolutional.conv2d.Conv2D object at 0x0000027AF70CEF20> and <keras.layers.activation.leaky_relu.LeakyReLU object at 0x0000027AF70AC580>).
W0923 21:54:29.223865  2608 restore.py:84] Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.convolutional.conv2d.Conv2D object at 0x0000027AF70CEF20> and <keras.layers.activation.leaky_relu.LeakyReLU object at 0x0000027AF70AC580>).
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.convolutional.conv2d.Conv2D object at 0x0000027AF70E9FF0> and <keras.layers.activation.leaky_relu.LeakyReLU object at 0x0000027AF70EB340>).
...
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027A9A9730D0> and <keras.engine.input_layer.InputLayer object at 0x0000027A9A94D810>).
W0923 21:54:29.386797  2608 restore.py:84] Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027A9A9730D0> and <keras.engine.input_layer.InputLayer object at 0x0000027A9A94D810>).
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.convolutional.conv2d.Conv2D object at 0x0000027A9A94CD30> and <keras.layers.merging.concatenate.Concatenate object at 0x0000027A9A94E110>).
W0923 21:54:29.386797  2608 restore.py:84] Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.convolutional.conv2d.Conv2D object at 0x0000027A9A94CD30> and <keras.layers.merging.concatenate.Concatenate object at 0x0000027A9A94E110>).
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027A9A94CD60> and <keras.layers.convolutional.conv2d.Conv2D object at 0x0000027A9A94CD30>).
W0923 21:54:29.386797  2608 restore.py:84] Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027A9A94CD60> and <keras.layers.convolutional.conv2d.Conv2D object at 0x0000027A9A94CD30>).
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.convolutional.conv2d.Conv2D object at 0x0000027A9A92F130> and <keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027A9A94CD60>).
W0923 21:54:29.386797  2608 restore.py:84] Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.convolutional.conv2d.Conv2D object at 0x0000027A9A92F130> and <keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027A9A94CD60>).
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027AFD7C9510> and <keras.layers.activation.leaky_relu.LeakyReLU object at 0x0000027A9A94C670>).
W0923 21:54:29.386797  2608 restore.py:84] Inconsistent references when loading the checkpoint into this object graph. For example, in the saved checkpoint object, `model.layer.weight` and `model.layer_copy.weight` reference the same variable, while in the current object these are two different variables. The referenced variables are:(<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x0000027AFD7C9510> and <keras.layers.activation.leaky_relu.LeakyReLU object at 0x0000027A9A94C670>).

Traceback (most recent call last):
  File "E:\sandbox\DirectML-master\TensorFlow\TF2\yolov3-tf2\detect.py", line 78, in <module>
    app.run(main)
  File "C:\Users\leand\AppData\Roaming\Python\Python310\site-packages\absl\app.py", line 308, in run
    _run_main(main, args)
  File "C:\Users\leand\AppData\Roaming\Python\Python310\site-packages\absl\app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "E:\sandbox\DirectML-master\TensorFlow\TF2\yolov3-tf2\detect.py", line 41, in main
    yolo.load_weights(FLAGS.weights).expect_partial()
  File "C:\Users\leand\AppData\Roaming\Python\Python310\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\leand\AppData\Roaming\Python\Python310\site-packages\tensorflow\python\training\saving\saveable_object_util.py", line 135, in restore
    raise ValueError(
ValueError: Received incompatible tensor with shape (1, 1, 256, 128) when attempting to restore variable with shape (1, 1, 128, 64) and name layer_with_weights-0/layer_with_weights-10/kernel/.ATTRIBUTES/VARIABLE_VALUE.
PatriceVignola commented 8 months ago

Development on this plugin has been paused for now. For the time being, all latest DirectML features and performance improvements are going into onnxruntime for inference scenarios. We'll update this issue if/when things change.