halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.87k stars 1.07k forks source link

intermittent failures of mul_div_mod with host-opencl #5634

Open steven-johnson opened 3 years ago

steven-johnson commented 3 years ago

There have been sporadic failures of correctness_mul_div_mod on the linux buildbots in the past week or two, but only when running with target=host-opencl. So far, no one has been able to reproduce it locally; more strangely, it doesn't seem to repeat even when logging into a buildbot with the failure and manually re-running. Injection point is unclear. Cause is unclear. Seems to happen on both the new linuxbots (with good GPUs) and the old ones (with crappy ones). Opening this issue for now, and will add instances of it occurring in comments.

steven-johnson commented 3 years ago

(One of the linuxbots, not sure which)

FAILED TEST: mul_div_mod
Testing mul vector_width: 1
Testing mul vector_width: 2
Testing mul vector_width: 4
Testing mul vector_width: 8
Testing mul vector_width: 16
Testing div_mod vector_width: 1
Testing div_mod vector_width: 2
Testing div_mod vector_width: 4
Testing div_mod vector_width: 8
Testing div_mod vector_width: 16
Failure!
0*-2147483648 -> -2147483648 != 0
Compiled a*b != simplified a*b: 0*-2147483648 = -2147483648 != 0
0*2147483647 -> 2147464527 != 0
Compiled a*b != simplified a*b: 0*2147483647 = 2147464527 != 0
65535*156429877 -> 946612400 != -454946357
23964*-1339975209 -> -1705444715 != -1990403580
47774*672691988 -> 1426553552 != -2053241256
52075*-387718748 -> 419095856 != 187456396
53663*975805134 -> -2138888350 != 389633010
19650*418751201 -> 1851754037 != -696239486
0*-2147483648 -> -2147483648 != 0
Compiled a*b != simplified a*b: 0*-2147483648 = -2147483648 != 0
0*2147483647 -> 2147464527 != 0
Compiled a*b != simplified a*b: 0*2147483647 = 2147464527 != 0
65535*156429877 -> 946612400 != -454946357
23964*-1339975209 -> -1705444715 != -1990403580
47774*672691988 -> 1426553552 != -2053241256
52075*-387718748 -> 419095856 != 187456396
53663*975805134 -> -2138888350 != 389633010
19650*418751201 -> 1851754037 != -696239486
make: *** [/home/halidenightly/build_bot/worker/x86-64-linux-testbranch-11-make/halide-source/Makefile:1820: quiet_correctness_mul_div_mod] Error 1
make: *** Waiting for unfinished jobs....
steven-johnson commented 3 years ago

On linux-worker-1:

32/382 Test #230: correctness_mul_div_mod ..................................**Failed Required regular expression not found. Regex=[Success! ] 12.59 sec Testing mul vector_width: 1 Testing mul vector_width: 2 Testing mul vector_width: 4 Testing mul vector_width: 8 Testing mul vector_width: 16 Testing div_mod vector_width: 1 Testing div_mod vector_width: 2 Testing div_mod vector_width: 4 Testing div_mod vector_width: 8 Testing div_mod vector_width: 16 -2147483648-2147483648 -> -2147483648 != 0 -21474836482147483647 -> -307452912 != -2147483648 -246045192156429877 -> -1334935948 != 2021125208 1366973852-1339975209 -> 692734145 != -523052540 -2026456418672691988 -> 1137444616 != 802555480 1864878955-387718748 -> 545177640 != -2027398260 549769631975805134 -> 224607668 != 1559319538 -1196340030418751201 -> -633749956 != 739326594 19363352611523969341 -> 978876867 != -351708311 2056150205*-798718768 -> 116646400 != -984560240 Failure!

steven-johnson commented 3 years ago

Another failure, on linux-bot-1:

 28/382 Test #230: correctness_mul_div_mod ..................................***Failed  Required regular expression not found. Regex=[Success!
] 23.31 sec
Testing mul vector_width: 1
Testing mul vector_width: 2
Testing mul vector_width: 4
Testing mul vector_width: 8
Testing mul vector_width: 16
Testing div_mod vector_width: 1
Testing div_mod vector_width: 2
Testing div_mod vector_width: 4
Testing div_mod vector_width: 8
Testing div_mod vector_width: 16
0*-2147483648 -> -2147483648 != 0
Compiled a*b != simplified a*b: 0*-2147483648 = -2147483648 != 0
0*2147483647 -> 2147464527 != 0
Compiled a*b != simplified a*b: 0*2147483647 = 2147464527 != 0
65535*156429877 -> 946612400 != -454946357
23964*-1339975209 -> -1705444715 != -1990403580
47774*672691988 -> 1426553552 != -2053241256
52075*-387718748 -> 419095856 != 187456396
53663*975805134 -> -2138888350 != 389633010
19650*418751201 -> 1851754037 != -696239486
Failure!
steven-johnson commented 3 years ago

Latest failure, from https://buildbot.halide-lang.org/master/#/builders/74/builds/8/steps/20/logs/stdio

....................................................................................................................................................................................................................................................................
FAILED TEST: mul_div_mod
Testing mul vector_width: 1
Testing mul vector_width: 2
Testing mul vector_width: 4
Testing mul vector_width: 8
Testing mul vector_width: 16
Testing div_mod vector_width: 1
Testing div_mod vector_width: 2
Testing div_mod vector_width: 4
Testing div_mod vector_width: 8
Testing div_mod vector_width: 16
Failure!
0*65535 -> 1 != 0
Compiled a*b != simplified a*b: 0*65535 = 1 != (uint16)0
65535*60981 -> 38649 != 4555
Compiled a*b != simplified a*b: 65535*60981 = 38649 != (uint16)4555
23964*39383 -> 45713 != 55812
Compiled a*b != simplified a*b: 23964*39383 = 45713 != (uint16)55812
47774*30484 -> 39312 != 1624
Compiled a*b != simplified a*b: 47774*30484 = 39312 != (uint16)1624
52075*57764 -> 45328 != 23436
Compiled a*b != simplified a*b: 52075*57764 = 45328 != (uint16)23436
div_mod failure for t=target(x86-64-linux-avx-avx2-f16c-fma-jit-opencl-sse41) w=2 scheduling=1:
(a/b)*b + a%b != a; a, b = 0, 1; q, r = 65535, 65535
div_mod failure for t=target(x86-64-linux-avx-avx2-f16c-fma-jit-opencl-sse41) w=2 scheduling=1:
Compiled a/b != simplified a/b: 0/1 = 65535 != (uint16)0
div_mod failure for t=target(x86-64-linux-avx-avx2-f16c-fma-jit-opencl-sse41) w=2 scheduling=1:
(a/b)*b + a%b != a; a, b = 5761, 1; q, r = 65535, 65535
div_mod failure for t=target(x86-64-linux-avx-avx2-f16c-fma-jit-opencl-sse41) w=2 scheduling=1:
Compiled a/b != simplified a/b: 5761/1 = 65535 != (uint16)5761
div_mod failure for t=target(x86-64-linux-avx-avx2-f16c-fma-jit-opencl-sse41) w=2 scheduling=1:
(a/b)*b + a%b != a; a, b = 3981, 1; q, r = 65535, 65535
div_mod failure for t=target(x86-64-linux-avx-avx2-f16c-fma-jit-opencl-sse41) w=2 scheduling=1:
Compiled a/b != simplified a/b: 3981/1 = 65535 != (uint16)3981
div_mod failure for t=target(x86-64-linux-avx-avx2-f16c-fma-jit-opencl-sse41) w=2 scheduling=1:
(a/b)*b + a%b != a; a, b = 505, 1; q, r = 65535, 65535
div_mod failure for t=target(x86-64-linux-avx-avx2-f16c-fma-jit-opencl-sse41) w=2 scheduling=1:
Compiled a/b != simplified a/b: 505/1 = 65535 != (uint16)505
div_mod failure for t=target(x86-64-linux-avx-avx2-f16c-fma-jit-opencl-sse41) w=2 scheduling=1:
(a/b)*b + a%b != a; a, b = 372, 1; q, r = 65535, 65535
div_mod failure for t=target(x86-64-linux-avx-avx2-f16c-fma-jit-opencl-sse41) w=2 scheduling=1:
Compiled a/b != simplified a/b: 372/1 = 65535 != (uint16)372
abadams commented 3 years ago

I was wondering about precision and overflow, but how the hell do you get 0*65535 == 1 from either of those issues?

dsharletg commented 3 years ago

I think the problem might be with the address arithmetic/buffer pointers, rather than the arithmetic itself. Maybe either the inputs or outputs are getting read/written to/form the wrong place, or not getting read or written correctly at all.