`reduce` performance is poor

nicowilliams commented 1 year ago

See #2679.

Tiny example to see what reduction compiles into:

jq --debug-dump-disasm -cn '[0,1,2,3,4]|reduce .[] as $i (0; if $i > . then $i else . end)'
0000 TOP
0001 LOADK [0,1,2,3,4]
0003 DUP
0004 LOADK 0                     <-------- initial reduction state
0006 STOREV $reduce:0            <-------- store the initial reduction state
0009 FORK 0045                   <-------- begin execution of the update body at the next instruction; later go to 045 to end the reduction
0011 DUPN
0012 EACH                        <-------- .[]
0013 STOREV $i:1                 <-------- as $i
0016 LOADVN $reduce:0            <-------- put the reduction state at the top of the stack
0019 DUP                         <-------- begin update expression
0020 SUBEXP_BEGIN                <-------- conditional start
0021 DUP
0022 SUBEXP_BEGIN
0023 LOADV $i:1
0026 SUBEXP_END
0027 CALL_BUILTIN _greater
0030 SUBEXP_END                  <-------- conditional end
0031 POP
0032 JUMP_F 0040
0034 POP
0035 LOADV $i:1
0038 JUMP 0041
0040 POP
0041 STOREV $reduce:0            <-------- reduction update expression end
0044 BACKTRACK                   <-------- reduction update expression end
0045 LOADVN $reduce:0            <-------- reduction end
0048 RET                         <-------- end of the whole thing

4

That's a lot of bytecodes, and that might have something to do with it, but we need to profile this with a larger example and see where the slowness is.

nicowilliams commented 1 year ago

Adding a bit of code (https://github.com/jqlang/jq/commit/018716bf2053f47b467a2be810978ba177667fbe in a branch, nicowilliams/inst-timing) to the VM to use rtdsc to get cycle counts (roughly) and print them after each instruction (well, not RET and BACKTRACK when they return from jq_next()), I get some results, but... these are polluted by my enabling tracing, so jv_dump() is being called, which slows things down.

Still, here's a taste. I'll add a new debug mode that prints not details, and another that prints per-opcode stats (perhaps just count of execs of each opcode and total cycles consumed by each opcode) so as to remove the pollution by jv_dump() and even the pollution by printf(). The pollution in the numbers below makes them pretty meaningless.

TOP 274 cycles
LOADK 586 cycles
DUP 400 cycles
LOADK 234 cycles
STOREV 1994 cycles
FORK 102 cycles
DUPN 204 cycles
EACH 1856 cycles
STOREV 1660 cycles
LOADVN 1576 cycles
DUP 140 cycles
SUBEXP_BEGIN 356 cycles
DUP 132 cycles
SUBEXP_BEGIN 132 cycles
LOADV 1222 cycles
SUBEXP_END 184 cycles
CALL_BUILTIN 2692 cycles
SUBEXP_END 152 cycles
POP 134 cycles
JUMP_F 134 cycles
POP 108 cycles
STOREV 1640 cycles
BACKTRACK 206 cycles
EACH 784 cycles
STOREV 1628 cycles
LOADVN 1202 cycles
DUP 140 cycles
SUBEXP_BEGIN 134 cycles
DUP 134 cycles
SUBEXP_BEGIN 134 cycles
LOADV 1160 cycles
SUBEXP_END 144 cycles
CALL_BUILTIN 906 cycles
SUBEXP_END 142 cycles
POP 116 cycles
JUMP_F 132 cycles
POP 108 cycles
LOADV 1142 cycles
JUMP 46 cycles
STOREV 1618 cycles
BACKTRACK 110 cycles
EACH 544 cycles
STOREV 1612 cycles
LOADVN 1150 cycles
DUP 140 cycles
SUBEXP_BEGIN 146 cycles
DUP 132 cycles
SUBEXP_BEGIN 132 cycles
LOADV 1138 cycles
SUBEXP_END 146 cycles
CALL_BUILTIN 830 cycles
SUBEXP_END 144 cycles
POP 114 cycles
JUMP_F 106 cycles
POP 100 cycles
LOADV 1130 cycles
JUMP 42 cycles
STOREV 1652 cycles
BACKTRACK 110 cycles
EACH 528 cycles
STOREV 1654 cycles
LOADVN 1160 cycles
DUP 136 cycles
SUBEXP_BEGIN 134 cycles
DUP 134 cycles
SUBEXP_BEGIN 130 cycles
LOADV 1134 cycles
SUBEXP_END 142 cycles
CALL_BUILTIN 692 cycles
SUBEXP_END 142 cycles
POP 114 cycles
JUMP_F 106 cycles
POP 104 cycles
LOADV 1124 cycles
JUMP 42 cycles
STOREV 1644 cycles
BACKTRACK 114 cycles
EACH 476 cycles
STOREV 1648 cycles
LOADVN 1164 cycles
DUP 132 cycles
SUBEXP_BEGIN 134 cycles
DUP 132 cycles
SUBEXP_BEGIN 130 cycles
LOADV 1138 cycles
SUBEXP_END 146 cycles
CALL_BUILTIN 688 cycles
SUBEXP_END 144 cycles
POP 114 cycles
JUMP_F 106 cycles
POP 102 cycles
LOADV 1128 cycles
JUMP 42 cycles
STOREV 1636 cycles
BACKTRACK 112 cycles
FORK 56 cycles
LOADVN 1122 cycles

which, sorted by count and piped to uniq -c yields:

      3 42 JUMP
      1 46 JUMP
      1 56 FORK
      1 100 POP
      1 102 FORK
      1 102 POP
      1 104 POP
      3 106 JUMP_F
      2 108 POP
      2 110 BACKTRACK
      1 112 BACKTRACK
      1 114 BACKTRACK
      3 114 POP
      1 116 POP
      2 130 SUBEXP_BEGIN
      4 132 DUP
      1 132 JUMP_F
      2 132 SUBEXP_BEGIN
      2 134 DUP
      1 134 JUMP_F
      1 134 POP
      4 134 SUBEXP_BEGIN
      1 136 DUP
      3 140 DUP
      3 142 SUBEXP_END
      3 144 SUBEXP_END
      1 146 SUBEXP_BEGIN
      2 146 SUBEXP_END
      1 152 SUBEXP_END
      1 184 SUBEXP_END
      1 204 DUPN
      1 206 BACKTRACK
      1 234 LOADK
      1 274 TOP
      1 356 SUBEXP_BEGIN
      1 400 DUP
      1 476 EACH
      1 528 EACH
      1 544 EACH
      1 586 LOADK
      1 688 CALL_BUILTIN
      1 692 CALL_BUILTIN
      1 784 EACH
      1 830 CALL_BUILTIN
      1 906 CALL_BUILTIN
      1 1122 LOADVN
      1 1124 LOADV
      1 1128 LOADV
      1 1130 LOADV
      1 1134 LOADV
      2 1138 LOADV
      1 1142 LOADV
      1 1150 LOADVN
      1 1160 LOADV
      1 1160 LOADVN
      1 1164 LOADVN
      1 1202 LOADVN
      1 1222 LOADV
      1 1576 LOADVN
      1 1612 STOREV
      1 1618 STOREV
      1 1628 STOREV
      1 1636 STOREV
      1 1640 STOREV
      1 1644 STOREV
      1 1648 STOREV
      1 1652 STOREV
      1 1654 STOREV
      1 1660 STOREV
      1 1856 EACH
      1 1994 STOREV
      1 2692 CALL_BUILTIN

nicowilliams commented 1 year ago

~Still, it seems likely that frame_local_var() is partly to blame in the LOADVN/STOREV cases.~ It's ~also~ clear that jv_cmp() is pretty slow. We should build some benchmarks of various jv functions. We might want to make jv_free() and jv_copy() inlined, too.

jqlang / jq

`reduce` performance is poor #2707