Open ScottTodd opened 1 year ago
Seeing more of these after https://github.com/iree-org/iree/pull/11522, @bjacob
https://github.com/iree-org/iree/actions/runs/3915922435/jobs/6694551641
Looking now.
question - how can I trigger Windows CI on a PR ?
question - how can I trigger Windows CI on a PR ?
You have to remove && needs.setup.outputs.ci-stage == 'postsubmit'
from here: https://github.com/iree-org/iree/blob/fc610481f467d1996ad58f0105a61535a99ad593/.github/workflows/ci.yml#L161-L163
(I'm still working on making that job run on presubmit)
Filed #11843 to investigate the cryptic assertion failure whenever a ukernel fails (non-Windows-specific). Once that's fixed, we will hopefully be able to better understand what is going wrong with ukernels on Windows.
Fixed by #11842
Fixed by #11842
The tests are still disabled though? https://github.com/openxla/iree/blob/3f1c154e45fd0f9426bbed9d0b8c9fea3db7a04c/build_tools/cmake/ctest_all.sh#L84-L88
oh ok, i thought "get them to stop complaining" was deemed enough.
No, this is tracking making the code work on Windows (or at least investigating enough to say that something in the tests themselves is different/broken on Windows).
We're very close to no tests excluded on only Windows - looks like only a few microkernel tests are excluded now.
@ScottTodd, thanks for the explanation. Now would be a great time to retry this. Ukernel test errors are now fatal, aborting near the root cause, so a stack trace would likely allow to resolve this for good.
Does this help?
Start 48: iree/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task
48: Test command: D:\dev\projects\iree-build\tools\iree-e2e-matmul-test.exe "D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task.yaml" "--device=local-task"
48: Working Directory: D:/dev/projects/iree-build/tests/e2e/matmul
48: Environment variables:
48: TEST_TMPDIR=D:/dev/projects/iree-build/test_tmpdir/iree/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task_test_tmpdir
48: IREE_BINARY_DIR=D:/dev/projects/iree-build
48: IREE_DYLIB_EXT=.dll
48: Test timeout computed to be: 60
48: --- CALL[module.matmul_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32] ---
48: Matmul shape (MxKxN): 1x1x1
48: --- CALL[module.matmul_1x1xi8_times_1x1xi8_into_1x1xi32] ---
48: Matmul shape (MxKxN): 1x1x1
48: Summary of reruns, pinpointing how general matrices need to be to reproduce this failure:
48: 🐞 LHS:ZERO * RHS:ZERO + ACCUMULATOR:ZERO
48:
48: ABORTED; while invoking native function hal.fence.await; while calling import;
48: [ 1] native hal.fence.await:0 -
48: [ 0] bytecode module.matmul_1x1xi8_times_1x1xi8_into_1x1xi32:624 D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task.mlir:7:13
48: at D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task.mlir:6:1; replaying trace file 'D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task.yaml'
7/21 Test #48: iree/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task ..................................***Failed 0.71 sec
Start 50: iree/tests/e2e/matmul/e2e_matmul_mmt4d_i8_small_ukernel_vmvx_local-task
50: Test command: D:\dev\projects\iree-build\tools\iree-e2e-matmul-test.exe "D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_mmt4d_i8_small_ukernel_vmvx_local-task.yaml" "--device=local-task"
50: Working Directory: D:/dev/projects/iree-build/tests/e2e/matmul
50: Environment variables:
50: TEST_TMPDIR=D:/dev/projects/iree-build/test_tmpdir/iree/tests/e2e/matmul/e2e_matmul_mmt4d_i8_small_ukernel_vmvx_local-task_test_tmpdir
50: IREE_BINARY_DIR=D:/dev/projects/iree-build
50: IREE_DYLIB_EXT=.dll
50: Test timeout computed to be: 60
50: --- CALL[module.matmul_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32] ---
50: Matmul shape (MxKxN): 1x1x1
50: Summary of reruns, pinpointing how general matrices need to be to reproduce this failure:
50: 🐞 LHS:ZERO * RHS:ZERO + ACCUMULATOR:ZERO
50:
50: D:\dev\projects\iree\tools\iree-e2e-matmul-test.c:762: ABORTED; replaying trace file 'D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_mmt4d_i8_small_ukernel_vmvx_local-task.yaml'
Thanks, better already -- please can you (1) ensure that this is a builds with assertions enabled and (2) share full test output? (e.g. ctest ... --output_on_failure
or rerun the Test command
manually) ?
That's all the output I see with -DCMAKE_BUILD_TYPE=RelWithDebInfo -DIREE_ENABLE_ASSERTIONS=ON
, output from individual tests.
Ha! Some unexpected error path is happening here, then. I'll get back to you with some printfs.
Can you retry with https://github.com/openxla/iree/pull/13139 ? It will abort closer to the root cause for a range of possible things that could go wrong, so if you are able to get into a debugger you could get extra useful info.
ucrtbase.dll!00007ffa205a286e() (Unknown Source:0)
iree-e2e-matmul-test.exe!iree_abort() Line 26 (d:\dev\projects\iree\runtime\src\iree\base\assert.h:26)
iree-e2e-matmul-test.exe!iree_status_abort(iree_status_handle_t * status) Line 588 (d:\dev\projects\iree\runtime\src\iree\base\status.c:588)
iree-e2e-matmul-test.exe!do_matmul_and_check_results(_iobuf * file, iree_trace_replay_t * replay, iree_vm_function_t function, matrix_mask_t * mask, iree_vm_list_t * original_device_inputs) Line 1042 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1042)
iree-e2e-matmul-test.exe!replay_event_call_matmul(iree_trace_replay_t * replay, yaml_document_s * document, yaml_node_s * event_node) Line 1170 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1170)
iree-e2e-matmul-test.exe!iree_e2e_matmul_test_trace_replay_event(iree_trace_replay_t * replay, yaml_document_s * document, yaml_node_s * event_node) Line 1260 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1260)
iree-e2e-matmul-test.exe!run_trace_file(iree_string_view_t root_path, _iobuf * file, iree_vm_instance_t * instance) Line 1299 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1299)
iree-e2e-matmul-test.exe!run_trace_files(int file_count, char * * file_paths, iree_vm_instance_t * instance) Line 1326 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1326)
iree-e2e-matmul-test.exe!main(int argc, char * * argv) Line 1345 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1345)
[Inline Frame] iree-e2e-matmul-test.exe!invoke_main() Line 78 (d:\a01\_work\43\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:78)
iree-e2e-matmul-test.exe!__scrt_common_main_seh() Line 288 (d:\a01\_work\43\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288)
kernel32.dll!00007ffa21e67604() (Unknown Source:0)
ntdll.dll!00007ffa22c026a1() (Unknown Source:0)
OK, so the error is at iree-e2e-matmul-test.c:1042
, that's the iree_vm_invoke
returning a failure status.
I thought that IREE_CHECK_OK
would helpfully print the failure status message -- can you see that just before the abort?
All I saw before that stack trace was
--- CALL[module.matmul_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32] ---
Matmul shape (MxKxN): 1x1x1
ABORTED; while invoking native function hal.fence.await; while calling import;
[ 1] native hal.fence.await:0 -
[ 0] bytecode module.matmul_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32:976 D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_f32_small_ukernel_vmvx_local-task.mlir:2:13
at D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_f32_small_ukernel_vmvx_local-task.mlir:1:1
OK, so iree_vm_invoke
is producing a failure status that doesn't contain very helpful information by itself. But at least now you know that the problem is iree_vm_invoke
failing at some point. Who's willing to jump into a debugger ? :-P
I wonder if it's the async nature of the local-task
driver that's resulting in the obscure error here, blaming the call to hal.fence.await
that seems like a red herring. Would that be a good use for local-sync
?
I remember in the past debugging issues like this where my best tool was finding the piece of code in IREE runtime that's meant to (and failed to actually) propagate errors from workers to main thread, and debugging on the worker thread. IIRC it was around here: https://github.com/openxla/iree/blob/c20535560e71c85ea325f3582692b020edb8b236/runtime/src/iree/task/task.c#L811-L815
Yes! That gives more info
The callstack there is
iree-e2e-matmul-test.exe!iree_task_dispatch_shard_execute(iree_task_dispatch_shard_t * task, unsigned int processor_id, unsigned int worker_id, iree_byte_span_t worker_local_memory, iree_task_submission_t * pending_submission) Line 813 (d:\dev\projects\iree\runtime\src\iree\task\task.c:813)
iree-e2e-matmul-test.exe!iree_task_worker_execute(iree_task_worker_t * worker, iree_task_t * task, iree_task_submission_t * pending_submission) Line 200 (d:\dev\projects\iree\runtime\src\iree\task\worker.c:200)
iree-e2e-matmul-test.exe!iree_task_worker_pump_once(iree_task_worker_t * worker, iree_task_submission_t * pending_submission) Line 251 (d:\dev\projects\iree\runtime\src\iree\task\worker.c:251)
iree-e2e-matmul-test.exe!iree_task_worker_pump_until_exit(iree_task_worker_t * worker) Line 298 (d:\dev\projects\iree\runtime\src\iree\task\worker.c:298)
iree-e2e-matmul-test.exe!iree_task_worker_main(iree_task_worker_t * worker) Line 379 (d:\dev\projects\iree\runtime\src\iree\task\worker.c:379)
kernel32.dll!00007ffa21e67604() (Unknown Source:0)
ntdll.dll!00007ffa22c026a1() (Unknown Source:0)
With this failure status:
0x0000024fa9c06ae3 INVALID_ARGUMENT: argument/result signature mismatch
"while invoking native function vmvx.matmul.f32f32f32"
"while calling import"
Very similar to https://github.com/openxla/iree/issues/11070 then (which I also got stuck debugging)
interesting! argument/result signature mismatch
sounds debuggable. It could be caused by e.g. having baked some assumption about size of POD types that's not true on windows.
Numerical results are still wrong for these two tests on Windows, and now also on Linux when using iree-compile
provided by python packages instead of a source build...?
385 - iree/tests/e2e/matmul/e2e_matmul_vmvx_dt_uk_i8_small_vmvx_local-task (Failed)
386 - iree/tests/e2e/matmul/e2e_matmul_vmvx_dt_uk_f32_small_vmvx_local-task (Failed)
Linux logs (from some new CI builds I'm setting up): https://github.com/ScottTodd/iree/actions/runs/10083878167/job/27881805875?pr=17#step:10:678
314/710 Test #385: iree/tests/e2e/matmul/e2e_matmul_vmvx_dt_uk_i8_small_vmvx_local-task .......................................***Failed 0.03 sec
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_1_1_1_acc_0] ---
Matmul shape (MxKxN): 1x1x1
--- TEST[matmul_accumulate_1x1xi8_times_1x1xi8_into_1x1xi32_1_1_1_acc_1] ---
Matmul shape (MxKxN): 1x1x1
--- TEST[matmul_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_1_1_1_2] ---
Matmul shape (MxKxN): 1x1x1
--- TEST[matmul_1x1xi8_times_1x1xi8_into_1x1xi32_1_1_1_3] ---
Matmul shape (MxKxN): 1x1x1
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_2_2_2_acc_4] ---
Matmul shape (MxKxN): 2x2x2
--- TEST[matmul_accumulate_2x2xi8_times_2x2xi8_into_2x2xi32_2_2_2_acc_5] ---
Matmul shape (MxKxN): 2x2x2
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_4_4_4_acc_6] ---
Matmul shape (MxKxN): 4x4x4
--- TEST[matmul_accumulate_4x4xi8_times_4x4xi8_into_4x4xi32_4_4_4_acc_7] ---
Matmul shape (MxKxN): 4x4x4
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_8_8_8_acc_8] ---
Matmul shape (MxKxN): 8x8x8
--- TEST[matmul_accumulate_8x8xi8_times_8x8xi8_into_8x8xi32_8_8_8_acc_9] ---
Matmul shape (MxKxN): 8x8x8
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_9_9_9_acc_10] ---
Matmul shape (MxKxN): 9x9x9
--- TEST[matmul_accumulate_9x9xi8_times_9x9xi8_into_9x9xi32_9_9_9_acc_11] ---
Matmul shape (MxKxN): 9x9x9
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_6_13_3_acc_12] ---
Matmul shape (MxKxN): 6x13x3
--- TEST[matmul_accumulate_6x13xi8_times_13x3xi8_into_6x3xi32_6_13_3_acc_13] ---
Matmul shape (MxKxN): 6x13x3
error: the actual and expected result matrices disagree at row 0, column 1.
actual value: -12
expected value: -2
left-hand side (rows 0..5 out of 0..5, columns 0..12 out of 0..12)
2 1 -2 -1 2 0 -1 -1 0 -2 2 -2 2
1 -2 2 0 2 1 -2 2 1 -1 -1 -2 0
-1 2 -1 2 0 0 0 0 0 -1 0 -1 -1
0 2 0 -1 -1 1 -2 1 0 1 -1 -2 0
1 -1 0 1 0 0 2 2 -1 -1 -1 1 -1
2 0 -2 1 -2 -2 2 -1 1 0 -2 -2 2
right-hand side (rows 0..12 out of 0..12, columns 0..2 out of 0..2)
-2 0 -2
-1 -1 -2
0 -2 2
1 -2 0
0 -1 -2
2 2 0
1 0 -1
-1 -2 -1
0 0 0
2 -1 1
1 -2 2
-1 0 2
1 1 -2
input accumulator (rows 0..5 out of 0..5, columns 0..2 out of 0..2)
-3 -7 2
-6 1 6
-4 7 -7
-5 -7 4
6 -6 2
-4 -7 -6
expected result (rows 0..5 out of 0..5, columns 0..2 out of 0..2)
-7 -2🦄 -16🦄
-9 -2🦄 1🦄
-4 3🦄 -12🦄
-6 -5🦄 -2🦄
1🦄 -9🦄 -1🦄
-6🦄 1🦄 -23🦄
actual result (rows 0..5 out of 0..5, columns 0..2 out of 0..2)
-7 -12🐞 7🐞
-9 -1🐞 3🐞
-4 7🐞 -11🐞
-6 -7🐞 6🐞
6🐞 -6🐞 2🐞
-4🐞 -7🐞 -6🐞
iree/tools/testing/e2e/iree-e2e-matmul-test.cc:532: ABORTED; while calling import; while invoking C++ function matmul_test.check_matmul_results;
[ 1] native matmul_test.check_matmul_results:0 -
[ 0] bytecode calls.matmul_accumulate_6x13xi8_times_13x3xi8_into_6x3xi32_6_13_3_acc_13:220 /home/runner/work/iree/iree/build-tests/tests/e2e/matmul/e2e_matmul_vmvx_dt_uk_i8_small_vmvx_local-task_calls.mlir:449:1
Spotted when adding a Windows CI build: https://github.com/iree-org/iree/actions/runs/3414442364/jobs/5682427788, and then reproduced locally.
Logs: