oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.64k stars 1.01k forks source link

ctest test_concurrency intermittent error on AArch64 with gcc #1690

Open AmyWignall-arm opened 1 year ago

AmyWignall-arm commented 1 year ago

Summary

The ctest test_concurrency fails intermittently with error:

43: Test command: /home/amywig01/oneDNN/naclBuild/tests/gtests/test_concurrency
43: Test timeout computed to be: 10000000
43: Note: Google Test filter = *:-*_GPU*
43: [==========] Running 1 test from 1 test suite.
43: [----------] Global test environment set-up.
43: [----------] 1 test from test_concurrency_t
43: [ RUN      ] test_concurrency_t.Basic
43: test_concurrency: pthread_mutex_lock.c:117: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
    Test #43: test_concurrency .................Child aborted***Exception:   0.32 sec

For both reference and ACL builds. It usually fails within 10 or 20 runs. In the example runs I did it always failed within 50 runs.

Version

onednn_verbose,info,oneDNN v3.2.0 (commit 1f428df708d943b2fb1bcb4c7f7e209cafaa7d22)

Environment

cpu: m6g.16xlarge:

$ lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        243.75
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-63
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; CSV2, BHB
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid as
                                 imdrdm lrcpc dcpop asimddp ssbs

Also fails with same error on cpu c6g.16xlarge:

$ lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Stepping:                        r1p1
BogoMIPS:                        2100.00
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-63
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; CSV2, BHB
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid as
                                 imdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm d
                                 it uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dg
                                 h rng

Steps to reproduce

ctest -VV -R test_concurrency --repeat-until-fail 100

Observed behavior

$ ctest -VV -R test_concurrency --repeat-until-fail 100
...
43: Test command: /home/amywig01/oneDNN/naclBuild/tests/gtests/test_concurrency
43: Test timeout computed to be: 10000000
43: Note: Google Test filter = *:-*_GPU*
43: [==========] Running 1 test from 1 test suite.
43: [----------] Global test environment set-up.
43: [----------] 1 test from test_concurrency_t
43: [ RUN      ] test_concurrency_t.Basic
43: test_concurrency: pthread_mutex_lock.c:117: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
    Test #43: test_concurrency .................Child aborted***Exception:   0.34 sec

Expected behavior

Test passes

michalowski-arm commented 1 day ago

As far as I can tell this is no longer an issue. Seems to be fixed with commit 557f3f0, I was not able to reproduce the error past it.