oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.52k stars 970 forks source link

Generic OpenCL kernels are broken #1960

Open nwnk opened 2 weeks ago

nwnk commented 2 weeks ago

The build documentation claims that generic OpenCL kernels are always available. I wanted to verify that they worked, and the straightforward way to do that seemed to be this:

commit 221691fab2a936267c0cf352e9b9b64ebf813973 (HEAD)
Author: Adam Jackson <ajax@redhat.com>
Date:   Mon Jun 10 22:03:30 2024 -0400

    cmake: Allow building no gen-specific OpenCL kernels

diff --git a/cmake/configuring_primitive_list.cmake b/cmake/configuring_primitive_list.cmake
index 3524f17107..75333fd1e6 100644
--- a/cmake/configuring_primitive_list.cmake
+++ b/cmake/configuring_primitive_list.cmake
@@ -55,6 +55,8 @@ message(STATUS "Enabled primitive CPU ISA: ${DNNL_ENABLE_PRIMITIVE_CPU_ISA}")

 if (DNNL_ENABLE_PRIMITIVE_GPU_ISA STREQUAL "ALL")
     set(BUILD_PRIMITIVE_GPU_ISA_ALL TRUE)
+elseif (DNNL_ENABLE_PRIMITIVE_GPU_ISA STREQUAL "NONE")
+    #
 else()
     foreach(isa ${DNNL_ENABLE_PRIMITIVE_GPU_ISA})
         string(TOUPPER ${isa} uisa)

And that builds! And it works more than it doesn't! With the Intel oneAPI 2024.1 DPC++ compiler, I built 3c0e1f1635c81ae9074f2deeff9977a2a8ef149d with the above patch, SYCL CPU and GPU backends. (I am not using the OpenCL driver from the oneAPI release. I am using Fedora 40's build of the Intel Compute Runtime, intel-compute-runtime-24.09.28717.17-1.fc40.x86_64. I don't expect that matters much here, but I can try with a different version if it helps.)

With the normal build, ctest says:

99% tests passed, 6 tests failed out of 453

Total Test time (real) = 6392.06 sec

The following tests FAILED:
        406 - test_benchdnn_modeC_concat_ci_gpu (Failed)
        408 - test_benchdnn_modeC_conv_gpu_ci_gpu (Failed)
        410 - test_benchdnn_modeC_deconv_ci_gpu (Failed)
        416 - test_benchdnn_modeC_graph_ci_gpu (Failed)
        432 - test_benchdnn_modeC_reorder_ci_gpu (Failed)
        450 - test_benchdnn_modeC_sum_ci_gpu (Failed)

Then, I rebuilt with DNNL_ENABLE_PRIMITIVE_GPU_ISA set to NONE, and ctest said:

78% tests passed, 99 tests failed out of 453

Total Test time (real) = 4957.57 sec

The following tests FAILED:
      4 - gpu-cnn-inference-f32-cpp (Failed)
      6 - gpu-cnn-inference-int8-cpp (Failed)
      8 - gpu-cnn-training-bf16-cpp (Failed)
     10 - gpu-cnn-training-f32-cpp (Failed)
     15 - gpu-graph-sycl-getting-started-cpp (Failed)
     16 - cpu-graph-sycl-single-op-partition-cpp (Failed)
     17 - gpu-graph-sycl-single-op-partition-cpp (Failed)
     19 - gpu-matmul-perf-cpp (Failed)
     21 - gpu-memory-format-propagation-cpp (Failed)
     23 - gpu-performance-profiling-cpp (Failed)
     33 - gpu-primitives-convolution-cpp (Failed)
     39 - gpu-primitives-inner-product-cpp (Failed)
     43 - gpu-primitives-lbr-gru-cpp (Failed)
     47 - gpu-primitives-lstm-cpp (SEGFAULT)
     49 - gpu-primitives-matmul-cpp (Failed)
     61 - gpu-primitives-shuffle-cpp (Failed)
     65 - gpu-primitives-sum-cpp (Failed)
     67 - gpu-primitives-vanilla-rnn-cpp (Failed)
     69 - gpu-rnn-training-f32-cpp (Failed)
     75 - gpu-tutorials-matmul-inference-int8-matmul-cpp (Failed)
     84 - test_binary_gpu (Failed)
     86 - test_binary_buffer_gpu (Failed)
     88 - test_concat_gpu (Failed)
     90 - test_concat_buffer_gpu (Failed)
     92 - test_concurrency_gpu (Failed)
     94 - test_concurrency_buffer_gpu (Failed)
     96 - test_convolution_backward_data_f32_gpu (Failed)
     98 - test_convolution_backward_data_f32_buffer_gpu (Failed)
    100 - test_convolution_backward_weights_f32_gpu (Failed)
    102 - test_convolution_backward_weights_f32_buffer_gpu (Failed)
    104 - test_convolution_eltwise_forward_f32_gpu (Failed)
    106 - test_convolution_eltwise_forward_f32_buffer_gpu (Failed)
    108 - test_convolution_eltwise_forward_x8s8f32s32_gpu (Failed)
    110 - test_convolution_eltwise_forward_x8s8f32s32_buffer_gpu (Failed)
    112 - test_convolution_forward_f32_gpu (Failed)
    114 - test_convolution_forward_f32_buffer_gpu (Failed)
    123 - test_cross_engine_reorder_buffer (Failed)
    125 - test_deconvolution_gpu (Failed)
    127 - test_deconvolution_buffer_gpu (Failed)
    177 - test_inner_product_backward_data_gpu (Failed)
    179 - test_inner_product_backward_data_buffer_gpu (Failed)
    181 - test_inner_product_backward_weights_gpu (Failed)
    183 - test_inner_product_backward_weights_buffer_gpu (Failed)
    185 - test_inner_product_forward_gpu (Failed)
    187 - test_inner_product_forward_buffer_gpu (Failed)
    197 - test_matmul_gpu (Failed)
    199 - test_matmul_buffer_gpu (Failed)
    201 - test_persistent_cache_api_gpu (Failed)
    203 - test_persistent_cache_api_buffer_gpu (Failed)
    209 - test_pooling_forward_gpu (Failed)
    211 - test_pooling_forward_buffer_gpu (Failed)
    217 - test_primitive_cache_mt_gpu (Failed)
    219 - test_primitive_cache_mt_buffer_gpu (Failed)
    225 - test_reorder_gpu (Failed)
    227 - test_reorder_buffer_gpu (Failed)
    237 - test_shuffle_gpu (Failed)
    239 - test_shuffle_buffer_gpu (Failed)
    245 - test_sum_gpu (Failed)
    247 - test_sum_buffer_gpu (Failed)
    298 - test_api (Failed)
    299 - test_api_buffer (Failed)
    304 - test_api_sycl (Failed)
    317 - test_graph_c_api_compile_usm_gpu (Failed)
    319 - test_graph_c_api_compile_parametrized_usm_gpu (Failed)
    321 - test_graph_cpp_api_compile_usm_gpu (Failed)
    323 - test_graph_cpp_api_partition_usm_gpu (Failed)
    325 - test_graph_cpp_api_compiled_partition_sycl_usm_gpu (Failed)
    353 - test_graph_unit_dnnl_batch_norm_usm_gpu (Failed)
    355 - test_graph_unit_dnnl_binary_op_usm_gpu (Failed)
    357 - test_graph_unit_dnnl_bmm_usm_gpu (Failed)
    359 - test_graph_unit_dnnl_compiled_partition_usm_gpu (Failed)
    361 - test_graph_unit_dnnl_concat_usm_gpu (Failed)
    363 - test_graph_unit_dnnl_conv_usm_gpu (Failed)
    365 - test_graph_unit_dnnl_convtranspose_usm_gpu (Failed)
    367 - test_graph_unit_dnnl_dequantize_usm_gpu (Failed)
    369 - test_graph_unit_dnnl_eltwise_usm_gpu (Failed)
    373 - test_graph_unit_dnnl_large_partition_usm_gpu (Failed)
    377 - test_graph_unit_dnnl_matmul_usm_gpu (Failed)
    381 - test_graph_unit_dnnl_pool_usm_gpu (Failed)
    385 - test_graph_unit_dnnl_quantize_usm_gpu (Failed)
    387 - test_graph_unit_dnnl_reduce_usm_gpu (Failed)
    389 - test_graph_unit_dnnl_reorder_usm_gpu (Failed)
    393 - test_graph_unit_dnnl_softmax_usm_gpu (Failed)
    406 - test_benchdnn_modeC_concat_ci_gpu (Failed)
    408 - test_benchdnn_modeC_conv_gpu_ci_gpu (Failed)
    410 - test_benchdnn_modeC_deconv_ci_gpu (Failed)
    412 - test_benchdnn_modeC_eltwise_ci_gpu (Failed)
    416 - test_benchdnn_modeC_graph_ci_gpu (Subprocess aborted)
    418 - test_benchdnn_modeC_ip_ci_gpu (Failed)
    424 - test_benchdnn_modeC_matmul_ci_gpu (Failed)
    426 - test_benchdnn_modeC_pool_ci_gpu (Failed)
    432 - test_benchdnn_modeC_reorder_ci_gpu (Failed)
    437 - test_benchdnn_modeC_gru_ci_gpu (SEGFAULT)
    438 - test_benchdnn_modeC_lstm_ci_gpu (SEGFAULT)
    439 - test_benchdnn_modeC_rnn_ci_gpu (SEGFAULT)
    444 - test_benchdnn_modeC_self_ci_gpu (Failed)
    446 - test_benchdnn_modeC_shuffle_ci_gpu (Failed)
    448 - test_benchdnn_modeC_softmax_ci_gpu (Failed)
    450 - test_benchdnn_modeC_sum_ci_gpu (Failed)

So 93 new failures. 107 GPU tests did pass, though, so it seems like this should work. This is on a gen9 GPU, specifically:

% lspci -vnn -s 0:2
00:02.0 Display controller [0380]: Intel Corporation CometLake-S GT2 [UHD Graphics 630] [8086:9bc5] (rev 05)

Since GEN9 is the lowest ISA specifically supported this suggests that some of the generic OpenCL kernels are broken.

nwnk commented 2 weeks ago

For additional data, with the OMP and OCL backends, the same baseline tests fail without the NONE setting; with it set, the OCL backend seems to be in better shape than SYCL:

76% tests passed, 78 tests failed out of 322

Total Test time (real) = 5526.59 sec

The following tests FAILED:
      7 - test_binary_gpu (Failed)
      8 - test_binary_buffer_gpu (Failed)
     10 - test_concat_gpu (Failed)
     11 - test_concat_buffer_gpu (Failed)
     13 - test_concurrency_gpu (Failed)
     14 - test_concurrency_buffer_gpu (Failed)
     16 - test_convolution_backward_data_f32_gpu (Failed)
     17 - test_convolution_backward_data_f32_buffer_gpu (Failed)
     19 - test_convolution_backward_weights_f32_gpu (Failed)
     20 - test_convolution_backward_weights_f32_buffer_gpu (Failed)
     22 - test_convolution_eltwise_forward_f32_gpu (Failed)
     23 - test_convolution_eltwise_forward_f32_buffer_gpu (Failed)
     25 - test_convolution_eltwise_forward_x8s8f32s32_gpu (Failed)
     26 - test_convolution_eltwise_forward_x8s8f32s32_buffer_gpu (Failed)
     28 - test_convolution_forward_f32_gpu (Failed)
     29 - test_convolution_forward_f32_buffer_gpu (Failed)
     36 - test_cross_engine_reorder (Failed)
     37 - test_cross_engine_reorder_buffer (Failed)
     39 - test_deconvolution_gpu (Failed)
     40 - test_deconvolution_buffer_gpu (Failed)
     78 - test_inner_product_backward_data_gpu (Failed)
     79 - test_inner_product_backward_data_buffer_gpu (Failed)
     81 - test_inner_product_backward_weights_gpu (Failed)
     82 - test_inner_product_backward_weights_buffer_gpu (Failed)
     84 - test_inner_product_forward_gpu (Failed)
     85 - test_inner_product_forward_buffer_gpu (Failed)
     93 - test_matmul_gpu (Failed)
     94 - test_matmul_buffer_gpu (Failed)
     96 - test_persistent_cache_api_gpu (Failed)
     97 - test_persistent_cache_api_buffer_gpu (Failed)
    102 - test_pooling_forward_gpu (Failed)
    103 - test_pooling_forward_buffer_gpu (Failed)
    108 - test_primitive_cache_mt_gpu (Subprocess aborted)
    109 - test_primitive_cache_mt_buffer_gpu (Subprocess aborted)
    114 - test_reorder_gpu (Failed)
    115 - test_reorder_buffer_gpu (Failed)
    123 - test_shuffle_gpu (Failed)
    124 - test_shuffle_buffer_gpu (Failed)
    129 - test_sum_gpu (Failed)
    130 - test_sum_buffer_gpu (Failed)
    170 - test_api (Failed)
    188 - test_graph_c_api_compile_usm_gpu (Failed)
    190 - test_graph_c_api_compile_parametrized_usm_gpu (Failed)
    192 - test_graph_cpp_api_compile_usm_gpu (Failed)
    194 - test_graph_cpp_api_partition_usm_gpu (Failed)
    196 - test_graph_cpp_api_compiled_partition_ocl_gpu (Failed)
    221 - test_graph_unit_dnnl_batch_norm_usm_gpu (Failed)
    223 - test_graph_unit_dnnl_binary_op_usm_gpu (Failed)
    225 - test_graph_unit_dnnl_bmm_usm_gpu (Failed)
    227 - test_graph_unit_dnnl_compiled_partition_usm_gpu (Failed)
    229 - test_graph_unit_dnnl_concat_usm_gpu (Failed)
    231 - test_graph_unit_dnnl_conv_usm_gpu (Failed)
    233 - test_graph_unit_dnnl_convtranspose_usm_gpu (Failed)
    235 - test_graph_unit_dnnl_dequantize_usm_gpu (Failed)
    237 - test_graph_unit_dnnl_eltwise_usm_gpu (Failed)
    241 - test_graph_unit_dnnl_large_partition_usm_gpu (Failed)
    245 - test_graph_unit_dnnl_matmul_usm_gpu (Failed)
    249 - test_graph_unit_dnnl_pool_usm_gpu (Failed)
    253 - test_graph_unit_dnnl_quantize_usm_gpu (Failed)
    255 - test_graph_unit_dnnl_reduce_usm_gpu (Failed)
    257 - test_graph_unit_dnnl_reorder_usm_gpu (Failed)
    261 - test_graph_unit_dnnl_softmax_usm_gpu (Failed)
    274 - test_benchdnn_modeC_concat_ci_gpu (Failed)
    276 - test_benchdnn_modeC_conv_gpu_ci_gpu (Failed)
    278 - test_benchdnn_modeC_deconv_ci_gpu (Failed)
    280 - test_benchdnn_modeC_eltwise_ci_gpu (Failed)
    284 - test_benchdnn_modeC_graph_ci_gpu (Subprocess aborted)
    286 - test_benchdnn_modeC_ip_ci_gpu (Failed)
    292 - test_benchdnn_modeC_matmul_ci_gpu (Failed)
    294 - test_benchdnn_modeC_pool_ci_gpu (Failed)
    300 - test_benchdnn_modeC_reorder_ci_gpu (Failed)
    305 - test_benchdnn_modeC_gru_ci_gpu (SEGFAULT)
    306 - test_benchdnn_modeC_lstm_ci_gpu (SEGFAULT)
    307 - test_benchdnn_modeC_rnn_ci_gpu (SEGFAULT)
    312 - test_benchdnn_modeC_self_ci_gpu (Failed)
    314 - test_benchdnn_modeC_shuffle_ci_gpu (Failed)
    316 - test_benchdnn_modeC_softmax_ci_gpu (Failed)
    318 - test_benchdnn_modeC_sum_ci_gpu (Failed)

88 GPU tests passed, so again, more working than not, but still not really working.

vpirogov commented 1 week ago

Intel(R) UHD Graphics 630 support was discontinued and the last driver update published in the end of 2022. oneDNN dropped support for GEN9 in v3.4 release. Looks like we neglected to drop GEN9 from the ISA list though.

Trying your patch on newer architecture (Xe-HPC) I see 'could not create a primitive' errors for some tests. This looks like empty ISA list results in issues with platform detection and/or kernel dispatching. If you want to make DNNL_ENABLE_PRIMITIVE_GPU_ISA=NONE work likely additional implementation changes would be needed.

densamoilov commented 12 hours ago

@nwnk,

The build documentation claims that generic OpenCL kernels are always available.

The documentation doesn't claim that, it says that ONEDNN_ENABLE_PRIMITIVE_GPU_ISA knob controls the just-in-time kernel generation based implementations and that the OpenCL based kernels and implementations are always available. It doesn't imply that the OpenCL kernels are generic even though some of them may be.

If there is a need to introduce generic OpenCL kernels then I believe that best way to do that would be via introducing a generic GPU vendor (ONEDNN_GPU_VENDOR=GENERIC). We have a plan to do that for SYCL GPU runtime.

The ONEDNN_ENABLE_PRIMITIVE_GPU_ISA knob should be used to control implementations within a particular vendor if there is such a need.