v1.0.0 causes crashes during execution

dibyendumajumdar commented 4 months ago

Ravi tests crash during execution, and I am not yet able to narrow it down. Initial investigation appears to show that if I isolate the code that fails, then it passes, but when run in the larger test program, it fails. This makes me think that its something to do with the size of the function being compiled.

I am flagging this for two reasons:

It is likely there is some bug in the code generation that needs to be fixed
I'd like some ideas on how to isolate the failure

The failure is basically causing runtime assertions to trigger in Ravi - that means some corruption somewhere.

Example of failure (there are other failures similar to this)

https://github.com/dibyendumajumdar/ravi/actions/runs/9374498129/job/25810662116

mingodad commented 4 months ago

Maybe if you have the code that was working before using this mir version and dump the generated asm and comparing then can give a clue ?

dibyendumajumdar commented 4 months ago

The master branch runs the previous version of MIR (v0.x) which works fine.

I could try dumping out the asm for both.

dibyendumajumdar commented 4 months ago

I am not sure how to generate assembly output.

I generated the MIR output from the two versions, along with the C input file. These are attached

test.mir_v1.0.txt test.mir_v0.0.txt test.c.txt

mingodad commented 4 months ago

After doing a search and replace with regular expressions on test.mir_v1.0.txt with this ):[^,]+ -> ) here is some recurrent differences I see in the diff between v0.0 and v1.0:

--- <v0.0>
+++ <v1.0>
@@ -2,16 +2,14 @@
    mov U0_base, u64:32(U0_ci)
    add U_10390, U0_base, 144
    mov U3277_ra, U_10390
-   mov i64:160(fp), 0
+   mov I3277_i, 0
    bnes    L6487, u16:8(U3277_ra), 19
 L6486:
-   mov U_10392, 160
-   add U_10392, U_10392, fp
+   addr    U_10392, I3277_i
    mov i64:(U_10392), i64:(U3277_ra)
    mov i_10393, 1
    jmp L6488
 L6487:
-   mov U_10395, 160
-   add U_10395, U_10395, fp
+   addr    U_10395, I3277_i
    call    proto15, luaV_tointegerns, i_10394, U3277_ra, U_10395, 0
    mov i_10393, i_10394

Also in a several functions the type of some variables have changed from v0.0 to v1.0:

--- <v0.0>
+++ <v1.0>
@@ -3,12 +3,14 @@
    local   i64:U0_k, i64:I_3, i64:U0_base, i64:U_4, i64:I_5, i64:U_6, i64:I_7, i64:U_8
    local   i64:I_9, i64:U_10, i64:I_11, i64:U_12, i64:I_13, i64:U_14, i64:I_15, i64:U_16
    local   i64:I_17, i64:U_18, i64:I_19, i64:U_20, i64:I_21, i64:U_22, i64:I_23, i64:U1_ra
-   local   i64:U_24, i64:i_25, i64:U_26, i64:i_27, i64:i_28, i64:U_29, i64:I_30, i64:U3_stackbase
-   local   i64:i3_wanted, i64:i_31, i64:i_32, i64:i3_available, i64:i_33, i64:i3_j, i64:i_34, i64:U5_src_reg
-   local   i64:U_35, i64:U5_dst_reg, i64:U_36, i64:I_37, i64:i_38, i64:i_39, i64:U_40, i64:U_41
-   local   i64:U_42, i64:I_43, i64:i_44, i64:i_45, i64:U_46, i64:U_47, i64:U_48, i64:U_49
-# 1 arg, 64 locals
-   alloca  fp, 176
+   local   i64:U_24, i64:I1_i, i64:i_25, i64:U_26, i64:i_27, i64:i_28, i64:U_29, i64:I_30
+   local   i64:U3_stackbase, i64:i3_wanted, i64:i_31, i64:i_32, i64:i3_available, i64:i_33, i64:i3_j, i64:i_34
+   local   i64:U5_src_reg, i64:U_35, i64:U5_dst_reg, i64:U_36, i64:I_37, i64:i_38, i64:i_39, i64:U_40
+   local   i64:U_41, i64:U_42, i64:I_43, i64:i_44, i64:i_45, i64:U_46, i64:U_47, i64:U_48
+   local   i64:U_49
+
+# 1 arg, 65 locals, 0 globals
+   alloca  fp, 160
    mov i0_raviX__error_code, 0
    mov i0_result, 0
    mov U0_ci, u64:32(U0_L)
@@ -61,17 +63,15 @@
 L15221:
    add U_24, U0_base, 0
    mov U1_ra, U_24
-   mov i64:160(fp), 0
+   mov I1_i, 0
    bnes    L15226, u16:8(U1_ra), 19
 L15225:
-   mov U_26, 160
-   add U_26, U_26, fp
+   addr    U_26, I1_i
    mov i64:(U_26), i64:(U1_ra)
    mov i_27, 1
    jmp L15227
 L15226:
-   mov U_29, 160
-   add U_29, U_29, fp
+   addr    U_29, I1_i
    call    proto15, luaV_tointegerns, i_28, U1_ra, U_29, 0
    mov i_27, i_28
 L15227:

dibyendumajumdar commented 4 months ago

thank you @mingodad

I need to find a reduced example of the issue - I will try to see if I can minimize failing tests; but if the issue is size related then it may be difficult to minimize.

mingodad commented 4 months ago

Probably what I said here Also in a several functions the type of some variables have changed from v0.0 to v1.0 is not correct but I noticed that in v0.0 there is less locals (64) than in v1.0 (65) but v0.0 call alloca fp, 176 and v1.0 call alloca fp, 160 (all the above are for __ravifunc_159: func i32, u64:U0_L).

mingodad commented 4 months ago

Running on Ubuntu-18.04 64bits with valgrind:

***** FILE 'errors.lua'*****
testing errors
==12900== Conditional jump or move depends on uninitialised value(s)
==12900==    at 0x543C8AA: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
==12900== Use of uninitialised value of size 8
==12900==    at 0x543883B: _itoa_word (_itoa.c:179)
==12900==    by 0x543BEDD: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
==12900==    at 0x5438845: _itoa_word (_itoa.c:179)
==12900==    by 0x543BEDD: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
==12900==    at 0x543BFE4: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
==12900==    at 0x543CB1C: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
+
testing stack overflow

mingodad commented 4 months ago

Here is my modified script to run the tests:

#!/bin/sh

LUA=$1
RUN_TRACEHOOK_TESTS=$2

if [ "$LUA" = "" ]
then
  echo "Please pass path to Ravi executable"
  exit 1
fi

run_lua53_tests() {
  arg=$1
  errmsg=$2
  cd lua53
  valgrind $LUA -e"$arg" all.lua
  rc=$?
  cd ..
  if [ $rc != 0 ]
  then
    echo "$errmsg"
    exit 1
  fi
}

run_ravi_tests() {
  dir=$1
  script=$2
  arg=$3
  cd $dir
  valgrind $LUA -e"$arg" $script
  rc=$?
  cd ..
  if [ $rc != 0 ]
  then
    echo "Test $script failed"
    exit 1
  fi
}

run_lua53_tests "_port=true ravi.jit(false)" "Lua53 interpreter test failed"
run_lua53_tests "_port=true ravi.auto(true)" "Lua53 auto JIT test failed"
run_lua53_tests "_port=true ravi.auto(true,1)" "Lua53 auto JIT all test failed"

run_ravi_tests language ravi_tests1.ravi "ravi.jit(false)"
run_ravi_tests language ravi_tests1.ravi "ravi.auto(true,1)"
run_ravi_tests language ravi_tests2.ravi "ravi.jit(false)"
run_ravi_tests language ravi_tests2.ravi "ravi.auto(true,1)"
run_ravi_tests language defer_tests.ravi "ravi.jit(false)"
run_ravi_tests language defer_tests.ravi "ravi.auto(true,1)"
run_ravi_tests language ravi_tests3.ravi "ravi.auto(true,1)"
run_ravi_tests language ravi_errors.ravi "ravi.auto(true,1)"
run_ravi_tests language basics.lua "ravi.auto(true,1)"
run_ravi_tests extra gaussian2.lua "ravi.auto(true,1)"
run_ravi_tests extra bittest.lua "ravi.auto(true,1)"
run_ravi_tests performance fannkuchen.ravi "ravi.auto(true,1)"
run_ravi_tests performance mandel1.ravi "ravi.auto(true,1)"
run_ravi_tests performance matmul1_ravi.lua "ravi.auto(true,1)"
run_ravi_tests performance sieve.ravi "ravi.auto(true,1)"
run_ravi_tests performance pisum.ravi "ravi.auto(true,1)"
run_ravi_tests performance md5test.lua "ravi.auto(true,1)"
run_ravi_tests comptests test.lua "ravi.jit(false)"
exit 0

dibyendumajumdar commented 4 months ago

thank you

dibyendumajumdar commented 3 months ago

One more piece of info - I wasn't setting an opt level, so it was using the default level.

Opt level 0 and 1 do not appear to have the bug. Opt level 2 fails.

dibyendumajumdar commented 3 months ago

It seems related to size somehow.

One of the failures is in a file containing many tests (the C code above is from that file); test 88 fails. If I remove first 10 tests then test 98 fails. If I remove first 20 tests then all tests pass.

dibyendumajumdar commented 3 months ago

Literally, if I remove one test, it fails after running 1 more test from the end.

dibyendumajumdar commented 3 months ago

I wonder if the source file is too large causing something to overflow; maybe an integer value is overflowing? Seems to be related to opt level 2.

rofl0r commented 3 months ago

for hard to debug stuff like this git bisect could be the right tool.

dibyendumajumdar commented 3 months ago

@rofl0r thanks for the suggestion; it may indeed by be a good idea to look back at the commits to see when it starts to fail. But the problem is that the bbv branch had many commits, and I not sure at what points in the commit history the branch was usable.

rofl0r commented 3 months ago

just go back to where it branched off. if it works there, you can tag it good, and start bisecting. the good thing about bisecting is that it needs only a few attempts to find the culprit even for hundreds or thousands of commits, because you always split the candidate commits in half.

vnmakarov commented 3 months ago

Thank you for working on this bug. Even providing the test case requires a lot of efforts. I really appreciate your involvement. Although I tested the release well but one person is not enough.

In order to work on this issue I need some standalone C program (the bug can be in C code, e.g. C undefined behavior can be used, the bug can be in c2mir compiler, although I don't see it in your comparison of different mir versions). The C program should be executable with some result which says that the program works or not. The C code can be very big. It does not matter for me. Having this program I can bisect optimizations and functions of the program and find the bug origin.

I understand that providing the standalone program requires even more efforts from you but without such program it is practically impossible to find the bug reason and fix the bug.

dibyendumajumdar commented 3 months ago

I guess the issue might be related to the size of a function; Lua code that is all part of a single script goes into a single top level function, making it large. It is hard to find such examples in regular C code.

Its impossible to create a standalone C program.

I think the best approach is as suggested by @rofl0r - but even this is a lot of effort, but I will have a go at it over the weekend. If we can narrow down the commit from when the bug started then it might help find the issue.

mingodad commented 3 months ago

But have you fixed the problems that valgrind found ?

dibyendumajumdar commented 3 months ago

@mingodad don't think those are real problems - its reporting issues with standard C functions.

I have ASAN enabled and no failures reported there - but of course I don't think ASAN covers JIT output.

dibyendumajumdar commented 3 months ago

Results from git bisect:

git bisect bad e489efb93d8199f8e6d410535005e51e198c03b6
e489efb93d8199f8e6d410535005e51e198c03b6 is the first bad commit
commit e489efb93d8199f8e6d410535005e51e198c03b6
Author: Vladimir N. Makarov <vmakarov@redhat.com>
Date:   Thu Mar 30 14:18:20 2023 -0400

    Implement semi-working (no-bootstrap) ssa combine:
      Remove assert for call results in target_machinize.
      Always check memory for non-strict in pattern_match_p.  Use op_ref instead of op in pattern_match_p and out_insn.
      Use const in get_ref_value.
      Add ssa_combine_ctx, free_ssa_edge, remove_ssa_edge_1, remove_disconnected_use_ssa_edge, merge_ssa_edge_lists.
      Remove assert change_ssa_edge_list_def.
      Remove edges only from definition in undo_build_ssa.  Add fixed_place_insn_p, use it for gvn_insn_p.
      Mave ssa code elimination and commutative_insn_code up.
      Add get_int_const, var_plus_const, var_mult_const, find_ssa_edge, update_op_combine_data, multiple_bb_def_p,
         no_mem_p, ssa_combine_op, processed_bb_insn_p, mark_bb_insn_as_processed, ssa_combine,
         init_ssa_combine, finish_ssa_combine
      Rename selection_ctx to combine_ctx.
      Call ssa_combine between conventional ssa and undo_build_ssa.

 mir-gen-x86_64.c | 145 +++++++++++-----------
 mir-gen.c        | 864 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
 2 files changed, 752 insertions(+), 257 deletions(-)

I realize that this may be an interim issue that may have been fixed - but I would require patch to continue bisecting.

CORRECTED

dibyendumajumdar commented 3 months ago

I am going to do a different bisect run to see if I can find a later point with a different issue

dibyendumajumdar commented 3 months ago

okay so my second bisect run ignores other errors and focuses on the issue with "large" functions.

git bisect good
3f693641b509a14a1e527600ce848696ff1d626e is the first bad commit
commit 3f693641b509a14a1e527600ce848696ff1d626e
Author: Vladimir N. Makarov <vmakarov@redhat.com>
Date:   Tue Jun 13 10:23:56 2023 -0400

    Constrain conflict matrix size:
      Add MIR_MAX_COALESCE_VARS.  Add scan_collected_moves.  Call scan_collected_moves in consider_move_vars_only.
      Extract new func collect_moves from coalesce.

 mir-gen.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 50 insertions(+), 32 deletions(-)

https://github.com/vnmakarov/mir/commit/3f693641b509a14a1e527600ce848696ff1d626e

To me this looks like a bug::

  scan_vars_num = 0;
  scan_collected_moves (gen_ctx);
  return scan_vars_num > 0;

mingodad commented 3 months ago

It seems that you didn't understood the valgrind output:

***** FILE 'errors.lua'*****
testing errors
==12900== Conditional jump or move depends on uninitialised value(s)
...
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
==12900== Use of uninitialised value of size 8
...
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
...
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
==12900==    at 0x543BFE4: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
...
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
+
testing stack overflow

Valgrind is interpreting the machine code and detected use of not initialized memory in libc but that has the origin in your code just around formatt %a %A that ultimately probably is calling printf like in libc, I'm not confirm that this is the problem but they are highly correlated and I'll not be surprised if it's indeed the causing of the failure (and other not known yet bugs).

testing strings and string library
testing 'format %a %A'
............./home/runner/work/ravi/build/ravi: strings.lua:366: bad argument #1 to 'assert' (value expected)
>>> closing state <<<

stack traceback:
    [C]: in function 'assert'
    strings.lua:366: in main chunk
    [C]: in function 'dofile'
    all.lua:167: in main chunk
    [C]: in ?
Lua53 auto JIT test failed

dibyendumajumdar commented 3 months ago

It is possible that there is a separate bug (notice that the call stack has luaG_runerror - so it is already in the process of reporting an error). I will have a look at it, but right now I don't think it is related.

The code that is being reported is this:

    if (n == 0)
      luaG_runerror(L, "attempt to divide by zero");

runerror tries to get more info - and its looking for a line number; I think the bug is that this doesn't make sense when using the source to JIT compiler. I will fix this.

dibyendumajumdar commented 3 months ago

Well, I checked this, and actually what I said above does not apply in Lua tests, because they are compiled from Lua bytecodes. Because the ravicomp tests were not even run yet - I am unsure what valgrind is complaining about - but it may be that the Lua value union has some part of it uninitialized?

dibyendumajumdar commented 3 months ago

okay so my second bisect run ignores other errors and focuses on the issue with "large" functions.

git bisect good
3f693641b509a14a1e527600ce848696ff1d626e is the first bad commit
commit 3f693641b509a14a1e527600ce848696ff1d626e
Author: Vladimir N. Makarov <vmakarov@redhat.com>
Date:   Tue Jun 13 10:23:56 2023 -0400

    Constrain conflict matrix size:
      Add MIR_MAX_COALESCE_VARS.  Add scan_collected_moves.  Call scan_collected_moves in consider_move_vars_only.
      Extract new func collect_moves from coalesce.

 mir-gen.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 50 insertions(+), 32 deletions(-)

3f69364

To me this looks like a bug::

  scan_vars_num = 0;
  scan_collected_moves (gen_ctx);
  return scan_vars_num > 0;

I can confirm that reverting this commit fixes the issue.

vnmakarov commented 3 months ago

To me this looks like a bug::

  scan_vars_num = 0;
  scan_collected_moves (gen_ctx);
  return scan_vars_num > 0;

Yes, you are absolutely right. I fixed this by https://github.com/vnmakarov/mir/commit/19d4a622c525428048da221ef0282f903c4bd6f3 .

Although I have a question: how is important generated code speed for the test case on which the problem occurs?

W/o this optimization (coalescing), the generated code will be quite bad. It will have a lot of moves as for each phi result and operand a copy is added. When there are a lot vars (and consequently conflict matrix is big), I could do coalescing on live range basis. The current coalescing based on conflict graph has more quality than coalescing based on live-range uses.

dibyendumajumdar commented 3 months ago

I can confirm the fix worked, thank you for looking into it. I assume this was size related; how is this feature tested? Presumably you do not have a test case that has a large enough function to hit the limit?

Re impact, I do not know yet as I haven't measured, but its okay I think to penalize very large functions, rather than fail completely or fail at runtime. I can of course limit JIT compilation based on function size. Usually such large functions are the top level script and the performance of these are unlikely to be important.

Another question is that in v0 there was no limit, but this limit was added in v1, is that right? What was the reason for it?

vnmakarov commented 3 months ago

Another question is that in v0 there was no limit, but this limit was added in v1, is that right? What was the reason for it?

v0 finds conflicts based on live-ranges. It is difficult to evaluate complexity of the algorithm but I would say it is "close to" linear. Therefore there is no limit.

v1 finds conflicts based on conflict graph. It is more superior. For example, v0 considers p1 and p2 conflicting in the following case (it is a pretty frequent case after out-of-SSA pass):

p2 = p1
...
use p1 and p1 becomes dead
...
use p2

when v1 recognizes that there is no conflict. Complexity (in time and space) of conflict graph based algorithm can be quadratic in the worst case. Therefore I introduced a limit.

Using v0 algorithm as a backup when we hit the limit of v1 algorithm could be a solution. I'll work on it. It will be not a big addition to the MIR-generator code.

vnmakarov / mir

v1.0.0 causes crashes during execution #403