iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.57k stars 574 forks source link

Assertion running small_ukernel_vmvx_local-task tests on Windows #11077

Open ScottTodd opened 1 year ago

ScottTodd commented 1 year ago

Spotted when adding a Windows CI build: https://github.com/iree-org/iree/actions/runs/3414442364/jobs/5682427788, and then reproduced locally.

The following tests FAILED:
     42 - iree/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task (Exit code 0xc0000409)
     43 - iree/tests/e2e/matmul/e2e_matmul_direct_f32_small_ukernel_vmvx_local-task (Exit code 0xc0000409)

Logs:

40/969 Test   #43: iree/tests/e2e/matmul/e2e_matmul_direct_f32_small_ukernel_vmvx_local-task ................................Exit code 0xc0000409
***Exception:   0.70 sec
--- CALL[module.matmul_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32] ---
Matmul shape (MxKxN): 1x1x1
Assertion failed: ((0)==(_InterlockedExchangeAdd((volatile LONG*)&task->pending_dependency_count, 0))), file C:\a\iree\iree\runtime\src\iree\task\task.c, line 107

        Start  159: iree/tests/e2e/tosa_ops/check_llvm-cpu_local-task_exp.mlir
 41/969 Test   #42: iree/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task .................................Exit code 0xc0000409
***Exception:   0.72 sec
--- CALL[module.matmul_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32] ---
Matmul shape (MxKxN): 1x1x1
Assertion failed: ((0)==(_InterlockedExchangeAdd((volatile LONG*)&task->pending_dependency_count, 0))), file C:\a\iree\iree\runtime\src\iree\task\task.c, line 107
ScottTodd commented 1 year ago

Seeing more of these after https://github.com/iree-org/iree/pull/11522, @bjacob

https://github.com/iree-org/iree/actions/runs/3915922435/jobs/6694551641

bjacob commented 1 year ago

Looking now.

bjacob commented 1 year ago

11842 should silence it like the existing entries. Separately, I'll file another issue about the cryptic error. ukernels can generate helpful error statuses, but when they do, half of the times, one gets instead this weird assertion failure. That is not specific to Windows; what seems Windows to specific is that ukernels actually generate an error, triggering that. It's time to finally fix that error-obscuring issue, so we can understand what is going wrong on Windows.

bjacob commented 1 year ago

question - how can I trigger Windows CI on a PR ?

ScottTodd commented 1 year ago

question - how can I trigger Windows CI on a PR ?

You have to remove && needs.setup.outputs.ci-stage == 'postsubmit' from here: https://github.com/iree-org/iree/blob/fc610481f467d1996ad58f0105a61535a99ad593/.github/workflows/ci.yml#L161-L163

(I'm still working on making that job run on presubmit)

bjacob commented 1 year ago

Filed #11843 to investigate the cryptic assertion failure whenever a ukernel fails (non-Windows-specific). Once that's fixed, we will hopefully be able to better understand what is going wrong with ukernels on Windows.

bjacob commented 1 year ago

Fixed by #11842

ScottTodd commented 1 year ago

Fixed by #11842

The tests are still disabled though? https://github.com/openxla/iree/blob/3f1c154e45fd0f9426bbed9d0b8c9fea3db7a04c/build_tools/cmake/ctest_all.sh#L84-L88

bjacob commented 1 year ago

oh ok, i thought "get them to stop complaining" was deemed enough.

ScottTodd commented 1 year ago

No, this is tracking making the code work on Windows (or at least investigating enough to say that something in the tests themselves is different/broken on Windows).

We're very close to no tests excluded on only Windows - looks like only a few microkernel tests are excluded now.

bjacob commented 1 year ago

@ScottTodd, thanks for the explanation. Now would be a great time to retry this. Ukernel test errors are now fatal, aborting near the root cause, so a stack trace would likely allow to resolve this for good.

ScottTodd commented 1 year ago

Does this help?

      Start 48: iree/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task
48: Test command: D:\dev\projects\iree-build\tools\iree-e2e-matmul-test.exe "D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task.yaml" "--device=local-task"
48: Working Directory: D:/dev/projects/iree-build/tests/e2e/matmul
48: Environment variables: 
48:  TEST_TMPDIR=D:/dev/projects/iree-build/test_tmpdir/iree/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task_test_tmpdir
48:  IREE_BINARY_DIR=D:/dev/projects/iree-build
48:  IREE_DYLIB_EXT=.dll
48: Test timeout computed to be: 60
48: --- CALL[module.matmul_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32] ---
48: Matmul shape (MxKxN): 1x1x1
48: --- CALL[module.matmul_1x1xi8_times_1x1xi8_into_1x1xi32] ---
48: Matmul shape (MxKxN): 1x1x1
48: Summary of reruns, pinpointing how general matrices need to be to reproduce this failure:
48: 🐞 LHS:ZERO       * RHS:ZERO       + ACCUMULATOR:ZERO      
48: 
48: ABORTED; while invoking native function hal.fence.await; while calling import; 
48: [ 1]   native hal.fence.await:0 -
48: [ 0] bytecode module.matmul_1x1xi8_times_1x1xi8_into_1x1xi32:624 D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task.mlir:7:13
48:       at D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task.mlir:6:1; replaying trace file 'D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task.yaml'
 7/21 Test #48: iree/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task ..................................***Failed    0.71 sec

      Start 50: iree/tests/e2e/matmul/e2e_matmul_mmt4d_i8_small_ukernel_vmvx_local-task
50: Test command: D:\dev\projects\iree-build\tools\iree-e2e-matmul-test.exe "D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_mmt4d_i8_small_ukernel_vmvx_local-task.yaml" "--device=local-task"
50: Working Directory: D:/dev/projects/iree-build/tests/e2e/matmul
50: Environment variables: 
50:  TEST_TMPDIR=D:/dev/projects/iree-build/test_tmpdir/iree/tests/e2e/matmul/e2e_matmul_mmt4d_i8_small_ukernel_vmvx_local-task_test_tmpdir
50:  IREE_BINARY_DIR=D:/dev/projects/iree-build
50:  IREE_DYLIB_EXT=.dll
50: Test timeout computed to be: 60
50: --- CALL[module.matmul_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32] ---
50: Matmul shape (MxKxN): 1x1x1
50: Summary of reruns, pinpointing how general matrices need to be to reproduce this failure:
50: 🐞 LHS:ZERO       * RHS:ZERO       + ACCUMULATOR:ZERO      
50: 
50: D:\dev\projects\iree\tools\iree-e2e-matmul-test.c:762: ABORTED; replaying trace file 'D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_mmt4d_i8_small_ukernel_vmvx_local-task.yaml'
bjacob commented 1 year ago

Thanks, better already -- please can you (1) ensure that this is a builds with assertions enabled and (2) share full test output? (e.g. ctest ... --output_on_failure or rerun the Test command manually) ?

ScottTodd commented 1 year ago

That's all the output I see with -DCMAKE_BUILD_TYPE=RelWithDebInfo -DIREE_ENABLE_ASSERTIONS=ON, output from individual tests.

bjacob commented 1 year ago

Ha! Some unexpected error path is happening here, then. I'll get back to you with some printfs.

bjacob commented 1 year ago

Can you retry with https://github.com/openxla/iree/pull/13139 ? It will abort closer to the root cause for a range of possible things that could go wrong, so if you are able to get into a debugger you could get extra useful info.

ScottTodd commented 1 year ago
ucrtbase.dll!00007ffa205a286e() (Unknown Source:0)
iree-e2e-matmul-test.exe!iree_abort() Line 26 (d:\dev\projects\iree\runtime\src\iree\base\assert.h:26)
iree-e2e-matmul-test.exe!iree_status_abort(iree_status_handle_t * status) Line 588 (d:\dev\projects\iree\runtime\src\iree\base\status.c:588)
iree-e2e-matmul-test.exe!do_matmul_and_check_results(_iobuf * file, iree_trace_replay_t * replay, iree_vm_function_t function, matrix_mask_t * mask, iree_vm_list_t * original_device_inputs) Line 1042 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1042)
iree-e2e-matmul-test.exe!replay_event_call_matmul(iree_trace_replay_t * replay, yaml_document_s * document, yaml_node_s * event_node) Line 1170 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1170)
iree-e2e-matmul-test.exe!iree_e2e_matmul_test_trace_replay_event(iree_trace_replay_t * replay, yaml_document_s * document, yaml_node_s * event_node) Line 1260 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1260)
iree-e2e-matmul-test.exe!run_trace_file(iree_string_view_t root_path, _iobuf * file, iree_vm_instance_t * instance) Line 1299 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1299)
iree-e2e-matmul-test.exe!run_trace_files(int file_count, char * * file_paths, iree_vm_instance_t * instance) Line 1326 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1326)
iree-e2e-matmul-test.exe!main(int argc, char * * argv) Line 1345 (d:\dev\projects\iree\tools\iree-e2e-matmul-test.c:1345)
[Inline Frame] iree-e2e-matmul-test.exe!invoke_main() Line 78 (d:\a01\_work\43\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:78)
iree-e2e-matmul-test.exe!__scrt_common_main_seh() Line 288 (d:\a01\_work\43\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288)
kernel32.dll!00007ffa21e67604() (Unknown Source:0)
ntdll.dll!00007ffa22c026a1() (Unknown Source:0)
bjacob commented 1 year ago

OK, so the error is at iree-e2e-matmul-test.c:1042, that's the iree_vm_invoke returning a failure status.

I thought that IREE_CHECK_OK would helpfully print the failure status message -- can you see that just before the abort?

ScottTodd commented 1 year ago

All I saw before that stack trace was

--- CALL[module.matmul_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32] ---
Matmul shape (MxKxN): 1x1x1
ABORTED; while invoking native function hal.fence.await; while calling import; 
[ 1]   native hal.fence.await:0 -
[ 0] bytecode module.matmul_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32:976 D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_f32_small_ukernel_vmvx_local-task.mlir:2:13
      at D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_direct_f32_small_ukernel_vmvx_local-task.mlir:1:1
bjacob commented 1 year ago

OK, so iree_vm_invoke is producing a failure status that doesn't contain very helpful information by itself. But at least now you know that the problem is iree_vm_invoke failing at some point. Who's willing to jump into a debugger ? :-P

bjacob commented 1 year ago

I wonder if it's the async nature of the local-task driver that's resulting in the obscure error here, blaming the call to hal.fence.await that seems like a red herring. Would that be a good use for local-sync ?

bjacob commented 1 year ago

I remember in the past debugging issues like this where my best tool was finding the piece of code in IREE runtime that's meant to (and failed to actually) propagate errors from workers to main thread, and debugging on the worker thread. IIRC it was around here: https://github.com/openxla/iree/blob/c20535560e71c85ea325f3582692b020edb8b236/runtime/src/iree/task/task.c#L811-L815

ScottTodd commented 1 year ago

Yes! That gives more info

The callstack there is

iree-e2e-matmul-test.exe!iree_task_dispatch_shard_execute(iree_task_dispatch_shard_t * task, unsigned int processor_id, unsigned int worker_id, iree_byte_span_t worker_local_memory, iree_task_submission_t * pending_submission) Line 813 (d:\dev\projects\iree\runtime\src\iree\task\task.c:813)
iree-e2e-matmul-test.exe!iree_task_worker_execute(iree_task_worker_t * worker, iree_task_t * task, iree_task_submission_t * pending_submission) Line 200 (d:\dev\projects\iree\runtime\src\iree\task\worker.c:200)
iree-e2e-matmul-test.exe!iree_task_worker_pump_once(iree_task_worker_t * worker, iree_task_submission_t * pending_submission) Line 251 (d:\dev\projects\iree\runtime\src\iree\task\worker.c:251)
iree-e2e-matmul-test.exe!iree_task_worker_pump_until_exit(iree_task_worker_t * worker) Line 298 (d:\dev\projects\iree\runtime\src\iree\task\worker.c:298)
iree-e2e-matmul-test.exe!iree_task_worker_main(iree_task_worker_t * worker) Line 379 (d:\dev\projects\iree\runtime\src\iree\task\worker.c:379)
kernel32.dll!00007ffa21e67604() (Unknown Source:0)
ntdll.dll!00007ffa22c026a1() (Unknown Source:0)

With this failure status:

0x0000024fa9c06ae3 INVALID_ARGUMENT: argument/result signature mismatch
"while invoking native function vmvx.matmul.f32f32f32"
"while calling import"

Very similar to https://github.com/openxla/iree/issues/11070 then (which I also got stuck debugging)

bjacob commented 1 year ago

interesting! argument/result signature mismatch sounds debuggable. It could be caused by e.g. having baked some assumption about size of POD types that's not true on windows.

ScottTodd commented 1 month ago

Numerical results are still wrong for these two tests on Windows, and now also on Linux when using iree-compile provided by python packages instead of a source build...?

    385 - iree/tests/e2e/matmul/e2e_matmul_vmvx_dt_uk_i8_small_vmvx_local-task (Failed)
    386 - iree/tests/e2e/matmul/e2e_matmul_vmvx_dt_uk_f32_small_vmvx_local-task (Failed)

Linux logs (from some new CI builds I'm setting up): https://github.com/ScottTodd/iree/actions/runs/10083878167/job/27881805875?pr=17#step:10:678

314/710 Test  #385: iree/tests/e2e/matmul/e2e_matmul_vmvx_dt_uk_i8_small_vmvx_local-task .......................................***Failed    0.03 sec
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_1_1_1_acc_0] ---
Matmul shape (MxKxN): 1x1x1
--- TEST[matmul_accumulate_1x1xi8_times_1x1xi8_into_1x1xi32_1_1_1_acc_1] ---
Matmul shape (MxKxN): 1x1x1
--- TEST[matmul_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_1_1_1_2] ---
Matmul shape (MxKxN): 1x1x1
--- TEST[matmul_1x1xi8_times_1x1xi8_into_1x1xi32_1_1_1_3] ---
Matmul shape (MxKxN): 1x1x1
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_2_2_2_acc_4] ---
Matmul shape (MxKxN): 2x2x2
--- TEST[matmul_accumulate_2x2xi8_times_2x2xi8_into_2x2xi32_2_2_2_acc_5] ---
Matmul shape (MxKxN): 2x2x2
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_4_4_4_acc_6] ---
Matmul shape (MxKxN): 4x4x4
--- TEST[matmul_accumulate_4x4xi8_times_4x4xi8_into_4x4xi32_4_4_4_acc_7] ---
Matmul shape (MxKxN): 4x4x4
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_8_8_8_acc_8] ---
Matmul shape (MxKxN): 8x8x8
--- TEST[matmul_accumulate_8x8xi8_times_8x8xi8_into_8x8xi32_8_8_8_acc_9] ---
Matmul shape (MxKxN): 8x8x8
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_9_9_9_acc_10] ---
Matmul shape (MxKxN): 9x9x9
--- TEST[matmul_accumulate_9x9xi8_times_9x9xi8_into_9x9xi32_9_9_9_acc_11] ---
Matmul shape (MxKxN): 9x9x9
--- TEST[matmul_accumulate_DYNxDYNxi8_times_DYNxDYNxi8_into_DYNxDYNxi32_6_13_3_acc_12] ---
Matmul shape (MxKxN): 6x13x3
--- TEST[matmul_accumulate_6x13xi8_times_13x3xi8_into_6x3xi32_6_13_3_acc_13] ---
Matmul shape (MxKxN): 6x13x3

error: the actual and expected result matrices disagree at row 0, column 1.

actual value: -12
expected value: -2

left-hand side (rows 0..5 out of 0..5, columns 0..12 out of 0..12)
 2    1   -2   -1    2    0   -1   -1    0   -2    2   -2    2   
 1   -2    2    0    2    1   -2    2    1   -1   -1   -2    0   
-1    2   -1    2    0    0    0    0    0   -1    0   -1   -1   
 0    2    0   -1   -1    1   -2    1    0    1   -1   -2    0   
 1   -1    0    1    0    0    2    2   -1   -1   -1    1   -1   
 2    0   -2    1   -2   -2    2   -1    1    0   -2   -2    2   

right-hand side (rows 0..12 out of 0..12, columns 0..2 out of 0..2)
-2    0   -2   
-1   -1   -2   
 0   -2    2   
 1   -2    0   
 0   -1   -2   
 2    2    0   
 1    0   -1   
-1   -2   -1   
 0    0    0   
 2   -1    1   
 1   -2    2   
-1    0    2   
 1    1   -2   

input accumulator (rows 0..5 out of 0..5, columns 0..2 out of 0..2)
-3   -7    2   
-6    1    6   
-4    7   -7   
-5   -7    4   
 6   -6    2   
-4   -7   -6   

expected result (rows 0..5 out of 0..5, columns 0..2 out of 0..2)
 -7    -2🦄 -16🦄 
 -9    -2🦄   1🦄 
 -4     3🦄 -12🦄 
 -6    -5🦄  -2🦄 
  1🦄  -9🦄  -1🦄 
 -6🦄   1🦄 -23🦄 

actual result (rows 0..5 out of 0..5, columns 0..2 out of 0..2)
 -7   -12🐞   7🐞 
 -9    -1🐞   3🐞 
 -4     7🐞 -11🐞 
 -6    -7🐞   6🐞 
  6🐞  -6🐞   2🐞 
 -4🐞  -7🐞  -6🐞 

iree/tools/testing/e2e/iree-e2e-matmul-test.cc:532: ABORTED; while calling import; while invoking C++ function matmul_test.check_matmul_results; 
[ 1]   native matmul_test.check_matmul_results:0 -
[ 0] bytecode calls.matmul_accumulate_6x13xi8_times_13x3xi8_into_6x3xi32_6_13_3_acc_13:220 /home/runner/work/iree/iree/build-tests/tests/e2e/matmul/e2e_matmul_vmvx_dt_uk_i8_small_vmvx_local-task_calls.mlir:449:1