oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.53k stars 973 forks source link

GPU tests pass when they probably shouldn't #1961

Open nwnk opened 1 month ago

nwnk commented 1 month ago

Using the oneAPI 2024.1 release, build the SYCL CPU and GPU backends. Ensure that no SYCL devices are available on the system. Then run ctest:

% export OCL_ICD_VENDORS=/dev/null
% sudo dnf -y remove oneapi-level-zero >& /dev/null
% sycl-ls | wc -l
0
% ctest >& test-broken-gpu.log
% grep gpu.*Passed test-broken-gpu.log
  2/453 Test   #2: gpu-bnorm-u8-via-binary-postops-cpp ......................   Passed    0.02 sec
  4/453 Test   #4: gpu-cnn-inference-f32-cpp ................................   Passed    0.02 sec
  6/453 Test   #6: gpu-cnn-inference-int8-cpp ...............................   Passed    0.02 sec
  8/453 Test   #8: gpu-cnn-training-bf16-cpp ................................   Passed    0.02 sec
 10/453 Test  #10: gpu-cnn-training-f32-cpp .................................   Passed    0.02 sec
 13/453 Test  #13: gpu-getting-started-cpp ..................................   Passed    0.02 sec
 15/453 Test  #15: gpu-graph-sycl-getting-started-cpp .......................   Passed    0.02 sec
 17/453 Test  #17: gpu-graph-sycl-single-op-partition-cpp ...................   Passed    0.02 sec
 19/453 Test  #19: gpu-matmul-perf-cpp ......................................   Passed    0.02 sec
 21/453 Test  #21: gpu-memory-format-propagation-cpp ........................   Passed    0.02 sec
 23/453 Test  #23: gpu-performance-profiling-cpp ............................   Passed    0.02 sec
 25/453 Test  #25: gpu-primitives-augru-cpp .................................   Passed    0.02 sec
 27/453 Test  #27: gpu-primitives-batch-normalization-cpp ...................   Passed    0.02 sec
 29/453 Test  #29: gpu-primitives-binary-cpp ................................   Passed    0.02 sec
 31/453 Test  #31: gpu-primitives-concat-cpp ................................   Passed    0.02 sec
 33/453 Test  #33: gpu-primitives-convolution-cpp ...........................   Passed    0.02 sec
 35/453 Test  #35: gpu-primitives-eltwise-cpp ...............................   Passed    0.02 sec
 37/453 Test  #37: gpu-primitives-group-normalization-cpp ...................   Passed    0.02 sec
 39/453 Test  #39: gpu-primitives-inner-product-cpp .........................   Passed    0.02 sec
 41/453 Test  #41: gpu-primitives-layer-normalization-cpp ...................   Passed    0.02 sec
 43/453 Test  #43: gpu-primitives-lbr-gru-cpp ...............................   Passed    0.02 sec
 45/453 Test  #45: gpu-primitives-lrn-cpp ...................................   Passed    0.02 sec
 47/453 Test  #47: gpu-primitives-lstm-cpp ..................................   Passed    0.02 sec
 49/453 Test  #49: gpu-primitives-matmul-cpp ................................   Passed    0.02 sec
 51/453 Test  #51: gpu-primitives-pooling-cpp ...............................   Passed    0.02 sec
 53/453 Test  #53: gpu-primitives-prelu-cpp .................................   Passed    0.02 sec
 55/453 Test  #55: gpu-primitives-reduction-cpp .............................   Passed    0.02 sec
 57/453 Test  #57: gpu-primitives-reorder-cpp ...............................   Passed    0.02 sec
 59/453 Test  #59: gpu-primitives-resampling-cpp ............................   Passed    0.02 sec
 61/453 Test  #61: gpu-primitives-shuffle-cpp ...............................   Passed    0.02 sec
 63/453 Test  #63: gpu-primitives-softmax-cpp ...............................   Passed    0.02 sec
 65/453 Test  #65: gpu-primitives-sum-cpp ...................................   Passed    0.02 sec
 67/453 Test  #67: gpu-primitives-vanilla-rnn-cpp ...........................   Passed    0.02 sec
 69/453 Test  #69: gpu-rnn-training-f32-cpp .................................   Passed    0.02 sec
 71/453 Test  #71: gpu-sycl-interop-buffer-cpp ..............................   Passed    0.02 sec
 73/453 Test  #73: gpu-sycl-interop-usm-cpp .................................   Passed    0.02 sec
 75/453 Test  #75: gpu-tutorials-matmul-inference-int8-matmul-cpp ...........   Passed    0.02 sec
 77/453 Test  #77: gpu-tutorials-matmul-weights-decompression-matmul-cpp ....   Passed    0.02 sec
233/453 Test #233: test_rnn_forward_gpu .....................................   Passed    0.02 sec
235/453 Test #235: test_rnn_forward_buffer_gpu ..............................   Passed    0.02 sec
249/453 Test #249: test_convolution_format_any_gpu ..........................   Passed    0.02 sec
251/453 Test #251: test_convolution_format_any_buffer_gpu ...................   Passed    0.02 sec
379/453 Test #379: test_graph_unit_dnnl_mqa_decomp_usm_gpu ..................   Passed    0.02 sec
391/453 Test #391: test_graph_unit_dnnl_sdp_decomp_usm_gpu ..................   Passed    0.02 sec
395/453 Test #395: test_graph_unit_dnnl_typecast_usm_gpu ....................   Passed    0.02 sec

If you try the same trick with the OCL backend, more things fail like they should, but there's still a few that xpass:

% grep gpu.*Passed test-broken-gpu.log
120/322 Test #120: test_rnn_forward_gpu .....................................   Passed    0.01 sec
121/322 Test #121: test_rnn_forward_buffer_gpu ..............................   Passed    0.01 sec
132/322 Test #132: test_convolution_format_any_gpu ..........................   Passed    0.01 sec
133/322 Test #133: test_convolution_format_any_buffer_gpu ...................   Passed    0.01 sec
247/322 Test #247: test_graph_unit_dnnl_mqa_decomp_usm_gpu ..................   Passed    0.01 sec
259/322 Test #259: test_graph_unit_dnnl_sdp_decomp_usm_gpu ..................   Passed    0.01 sec
263/322 Test #263: test_graph_unit_dnnl_typecast_usm_gpu ....................   Passed    0.01 sec
vpirogov commented 1 month ago

@nwnk, this behavior is expected. As there's no guarantee that GPU is present GPU tests report pass if no devices are available.

nwnk commented 4 weeks ago

@nwnk, this behavior is expected. As there's no guarantee that GPU is present GPU tests report pass if no devices are available.

If that were really expected, I would expect it to be consistent. In that OCL build, 7 gpu tests passed, but 156 failed. I don't understand why those seven ought to be different.

vpirogov commented 4 weeks ago

Good point. I missed the fact that some tests still fail. Let me try to reproduce it.

Do you see anything useful in failed tests output?

densamoilov commented 2 weeks ago

@nwnk,

If that were really expected

It is really expected for the examples but not for the tests (gtests and benchdnn) because we only include examples in the binary releases (read oneAPI releases) so we don't want them to fail on systems that don't have GPUs.

There is one example that fails (gpu_opencl_interop) but it's a bug in the error handling mechanism.