smlnj / legacy

This project is the old version of Standard ML of New Jersey that continues to support older systems (e.g., 32-bit machines).
BSD 3-Clause "New" or "Revised" License
34 stars 9 forks source link

MLRISC generates illegal AMD64 instruction #332

Open zbyrn opened 2 weeks ago

zbyrn commented 2 weeks ago

Version

110.99.6.1 (Latest)

Operating System

OS Version

No response

Processor

System Component

Core system

Severity

Major

Description

On AMD64, a mov(absq) instruction can only move a 64-bit immediate value to a register, but not to a memory location (AMD64 Manual, Volumn 3, page 234). When MLRISC selects instructions for 64-bit literal assignment, it emits a mov instruction that moves the value to the destination pseudo-register (link to source). After register allocation, if the destination pseudo-register is spilled to the stack, the destination operand is replaced with an indirect stack access, which is not supported.

Transcript

Truncated to show only the relevant parts:

*********************************************** 
known_chk v749(v1448[PV],v1447[PV],v1446[PV],v1445[FN],v1444[C],v1443[C],v1442[PR5],v1441[I64],v1440[I],v1439[PV]) =
   if u64<(v1441,(I64)8) [v755] then
      addu63(v1440,(I63t)1) -> v313[I]
      if u64<=(v1441,(I64)4611686018427387903) [v900] then
         trunc_64_63(v1441) -> v898[I]
         (L)v897(v1448,v1447,v1446,v1445,v1444,v313,v1439,v1443,v1442,v898)
      else
         iadd64((I64)9223372036854775808,(I64)9223372036854775808) -> v899[I64]
         (L)v897(v1448,v1447,v1446,v1445,v1444,v313,v1439,v1443,v1442,v1441)
   else
      rshiftlu64(v1441,(I63t)3) -> v317[I64]
      addu63(v1440,(I63t)1) -> v318[I]
      andbu64(v1441,(I64)7) -> v321[I64]
      if u64<=(v321,(I64)4611686018427387903) [v904] then
         trunc_64_63(v321) -> v902[I]
         (L)v901(v1448,v1447,v1446,v1445,v1444,v317,v318,v1439,v1443,v1442,v902)
      else
         iadd64((I64)9223372036854775808,(I64)9223372036854775808) -> v903[I64]
         (L)v901(v1448,v1447,v1446,v1445,v1444,v317,v318,v1439,v1443,v1442,v321)
*********************************************** 
...
[ after instruction selection ]
...
BLOCK 9 [1.000000]
    succ:     10:false[0.000000], 25:true[0.000000]
    pred:     8:false[0.000000]
    movabsq $-9223372036854775808, %r572
    movabsq $-9223372036854775808, %r573
    movq    %r573, %r571
    addq    %r572, %r571
    /* branch(1/1000) */
    jo  trap36
...
[ after register allocation ]
...
BLOCK 9 [1.985834]
    succ:     10:false[1.985337], 25:true[0.000497]
    pred:     8:false[1.985834]
    movabsq $-9223372036854775808, 16(%rsp)
    movabsq $-9223372036854775808, %r11
    movq    %r11, 88(%rsp)
    movq    16(%rsp), %r11
    addq    %r11, 88(%rsp)
    /* branch(1/1000) */
    jo  trap36

Expected Behavior

The code after register allocation should move the immediate value into a temporary register, and then move that to the stack.

movabsq $-9223372036854775808, %rax
movq    %rax, 16(%rsp)

Steps to Reproduce

This problem only happens when using the new closure converter, which typically uses more registers than the current one. I created a snapshot at my development repository [link]. Use the following to reproduce the error,

cd base/system/
./cmb-make && ./makeml
echo 'CM.make "./Basis/basis.cm";' | ./testml

Additional Information

I have implemented the following temporary fix, which seems to solve the issue in the short term.

diff --git a/MLRISC/amd64/mltree/amd64-gen.sml b/MLRISC/amd64/mltree/amd64-gen.sml
index 2255bcc..5967473 100644
--- a/MLRISC/amd64/mltree/amd64-gen.sml
+++ b/MLRISC/amd64/mltree/amd64-gen.sml
@@ -207,7 +207,15 @@ functor AMD64Gen (
                else mark' (I.COPY {k=CB.GP, sz=ty, src=[s], dst=[d], tmp=NONE}, an)
          | move' (ty, I.Immed 0, dst as I.Direct _, an) =
              mark' (I.binary {binOp=O.xorOp ty, src=dst, dst=dst}, an)
-         | move' (ty, src as I.Immed64 n, dst, an) = move64' (src, dst, an)
+         | move' (ty, src as I.Immed64 n, dst, an) = let
+              (* MOVABSQ cannot take an indirect memory access dst. The
+               * immediate value is moved to a tmp in case dst is spilled. *)
+              val tmp = newReg ()
+              val tmpR = I.Direct (64, tmp)
+              in
+                move64' (src, tmpR, an);
+                emitInstr (I.move {mvOp=I.MOVQ, src=tmpR, dst=dst})
+              end
          | move' (ty, src as I.ImmedLabel _ , dst as I.Direct _, an) = move64' (src, dst, an)
          | move' (ty, src as I.ImmedLabel _ , dst, an) = let
              val tmp = newReg ()

Whenever a 64-bit value that cannot be represented as 32-bit extension needs to be moved to a register, it is moved to a temporary pseudo-register first (tmp) and then moved to dst. Since the lifetime of tmp is very short, it is unlikely that tmp would be spilled to the stack, satisfying the movabsq requirement. However, I don't think it is a permanent solution since it is not guaranteed that tmp would be allocated a register either --- it is just very likely to be the case. A more disciplined fix would be to implement some kind of normalization of movabsq instructions after register allocation.

Email address

byronzhong@cs.uchicago.edu

JohnReppy commented 1 week ago

Do you know what the SML source code is that produces this issue? It looks like it is just trying to force an overflow trap.

JohnReppy commented 1 week ago

I believe that the above fix has the right idea, but that we should make it part of the move64' function, since there are other places where it might be a problem. We also need a special case for when the destination is a register.

    fun move64' (src, dst as I.Direct _, an) = mark' (move64(src, dst), an)
          | move64' (src, dst, an) = let
              (* the `MOVABSQ` instruction can only move values into registers, so
               * we need to use a temporary register here.
               *)
              val tmpR = I.Direct (64, newReg ())
              in
                move64' (src, tmpR, an);
                emitInstr (I.move {mvOp=I.MOVQ, src=tmpR, dst=dst})
              end

We can then simplify the move' function's cases for Immed64 and ImmedLabel operands.

      | move' (ty, src as I.Immed64 n, dst, an) = move64' (src, dst, an)
      | move' (ty, src as I.ImmedLabel _ , dst, an) = move64' (src, dst, an)

I've pushed these changes into the CFA repository for testing.

zbyrn commented 1 week ago

The source code that produces this issue is the standard basis file num-format64.sml at base/system/Basis/Implementation/num-format64.sml. The particular assembly code snippet is from the function wordToDec although this pattern also shows up elsewhere.

val iadd = InlineT.Int.fast_add

fun mkDigit (w : Word64.word) =
      InlineT.CharVector.sub("0123456789ABCDEF", W.toInt w)

fun wordToDec w = let
      fun f (w, n, l) = if (w < 0w10)
        then (iadd(n, 1), (mkDigit w) :: l)
        else let val j = w div 0w10
          in
        f (j,  iadd(n, 1), mkDigit(w - 0w10*j) :: l)
          end
      in
    f (w, 0, [])
      end

The deliberate overflow trigger is generated by the cpsopt-code phase. In mkDigit, W.toInt needs to perform a checked conversion from a 64-bit integer to 63-bit. The optimization pass seems to convert it to an explicit check.

[Before lowering]

std wordToDec713(v761[C],v193[I64]) =
 known_tail f763(v373[I64],v374[I],v375[PV]) =
    if u64<(v373,(I64)10) [v769] then
       addu63(v374,(I63t)1) -> v342[I]
>      testu_64_63(v373) -> v548[I]
       numsubscriptvu8("0123456789ABCDEF",v548) -> v549[I]
       {v549,v375} -> v195
       v761(v342,v195)
    else
       quotu64(v373,(I64)10) -> j350[I64]
       addu63(v374,(I63t)1) -> v357[I]
       mulu64((I64)10,j350) -> v361[I64]
       subu64(v373,v361) -> v360[I64]
>      testu_64_63(v360) -> v551[I]
       numsubscriptvu8("0123456789ABCDEF",v551) -> v552[I]
       {v552,v375} -> v196
       f763(j350,v357,v196)
 f763(v193,(I63t)0,(I63t)0)

[After cpsopt-code ...]

std wordToDec713(v761[C],v193[I64]) =
 known_tail f763(v373[I64],v374[I],v375[PV]) =
    if u64<(v373,(I64)10) [v769] then
       addu63(v374,(I63t)1) -> v342[I]
       std_cont v905(v548[I]) =
          numsubscriptvu8("0123456789ABCDEF",v548) -> v549[I]
          {v549,v375} -> v195
          v761(v342,v195)
>      if u64<=(v373,(I64)4611686018427387903) [v908] then
>         trunc_64_63(v373) -> v906[I]
>         v905(v906)
>      else
>         iadd64((I64)9223372036854775808,(I64)9223372036854775808) -> v907[I64]
>         v905(v373)
    else
       quotu64(v373,(I64)10) -> j350[I64]
       addu63(v374,(I63t)1) -> v357[I]
       mulu64((I64)10,j350) -> v361[I64]
       subu64(v373,v361) -> v360[I64]
       std_cont v909(v551[I]) =
          numsubscriptvu8("0123456789ABCDEF",v551) -> v552[I]
          {v552,v375} -> v196
          f763(j350,v357,v196)
>      if u64<=(v360,(I64)4611686018427387903) [v912] then
>         trunc_64_63(v360) -> v910[I]
>         v909(v910)
>      else
>         iadd64((I64)9223372036854775808,(I64)9223372036854775808) -> v911[I64]
>         v909(v360)
 f763(v193,(I63t)0,(I63t)0)
zbyrn commented 1 week ago

The proposed fix doesn't solve the issue because register allocation occurs after instruction selection. During instruction selection, all intermediate values are represented as pseudo-registers, and all moves have their dst operands as I.Direct _. It is only after register allocation that we know which of these I.Direct _ stay as direct operands. I pushed a revised fix that doesn't have the special case for register destination.

diff --git a/MLRISC/amd64/mltree/amd64-gen.sml b/MLRISC/amd64/mltree/amd64-gen.sml
index 8ba2417..8e8e0b1 100644
--- a/MLRISC/amd64/mltree/amd64-gen.sml
+++ b/MLRISC/amd64/mltree/amd64-gen.sml
@@ -199,16 +199,18 @@ functor AMD64Gen (
        (* annotate an expression and emit it *)
        fun mark (i, an) = mark' (I.INSTR i, an)
        (* annotated 64-bit move *)
-       fun move64' (src, dst as I.Direct _, an) = mark' (move64(src, dst), an)
-          | move64' (src, dst, an) = let
+       fun move64' (src as (I.Immed64 _ | I.ImmedLabel _), dst, an) = let
               (* the `MOVABSQ` instruction can only move values into registers, so
                * we need to use a temporary register here.
                *)
               val tmpR = I.Direct (64, newReg ())
               in
-                move64' (src, tmpR, an);
+                mark' (move64 (src, tmpR), an);
                 emitInstr (I.move {mvOp=I.MOVQ, src=tmpR, dst=dst})
               end
+          | move64' _ =
+              error "move64' works on immediate src. For other moves, use \
+                    \`move' instead"
        (* move with annotation *)
        fun move' (ty, dst as I.Direct (_, s), src as I.Direct (_, d), an) =
              if CB.sameColor (s, d)
JohnReppy commented 1 week ago

I think that the best solution is to modify the spilling code to use the scratch register (I.C.asmTmpR) when it encounters this situation. Really, the RA should not be spilling constants, since it can dematerialize them instead of loading from memory, but I do not remember how to make that work in MLRISC. I'll see if I can fix the spiller.

zbyrn commented 1 week ago

I agree that there is no good reason to spill a constant, and that's the reason why I described the original patch a temporary fix. If there is always a scratch register on x64, another approach could be to "normalize" the instruction right before getting emitted: if the emitter is asked to emit movabsq imm, indirect, the emitter can just emit two instructions movabsq imm, %rax; movq %rax, indirect.