Redundant renormalize_array causing slowdown

leokaplan commented 4 years ago

Today, the following loop

for i=1,n do
        xs[y] = xs[y] + 1
end

is compiled to

for(
        lua_Integer start01 = 1, limit01 = x1, step01 = 1;
        (step01 >= 0 ? start01 <= limit01 : start01 >= limit01);
        start01 = intop(+, start01, step01)
){
x4 = start01;
{
            pallene_renormalize_array(L, x2, x3, 3);
            TValue *slot = &x2->array[x3 - 1];
            if (PALLENE_UNLIKELY(!ttisinteger(slot))) {
                pallene_runtime_tag_check_error(L,
                    3, LUA_TNUMINT, rawtt(slot),
                    "array element");
            }
            x6 = ivalue(slot);
}
x5 = intop(+, x6, 1);
{
            pallene_renormalize_array(L, x2, x3, 3);
            TValue *slot = &x2->array[x3 - 1];
            setivalue(slot, x5);
}
}

When we remove the renormalize _array to outside the loop, we have performance gains of aprox. 5x (from 200 ms to 60 ms):

pallene_renormalize_array(L, x2, x3, 3);
TValue *slot = &x2->array[x3 - 1];
for(
        lua_Integer start01 = 1, limit01 = x1, step01 = 1;
        (step01 >= 0 ? start01 <= limit01 : start01 >= limit01);
        start01 = intop(+, start01, step01)
){
x4 = start01;
{
            if (PALLENE_UNLIKELY(!ttisinteger(slot))) {
                pallene_runtime_tag_check_error(L,
                    3, LUA_TNUMINT, rawtt(slot),
                    "array element");
            }
            x6 = ivalue(slot);
}
x5 = intop(+, x6, 1);
{
            setivalue(slot, x5);
}
}

(removing the runtime tag checking didn't had a any observable impact) This optimization can only be performed in cases in which the attribution to a slot is a result of a series of operations in that same slot.

Interesting benchmark suites for this optimization are the matmul and nbody

leokaplan commented 4 years ago

To be able to perform this optimization in pallene-compile time (not in C), we would have to change the form of array operations emission. A form of doing this is altering the coder, namely the gen_cmd["GetArr"] and gen_cmd["SetArr"] functions. But given the constraint cited in the last comment, it is necessary to give some context to these functions. This context may be given via the the modification of the IR to introduce slot acquisition and slot value operations (get/set) as separate commands.

hugomg commented 4 years ago

One thing that we have to be careful about is that the renormalize operation can't always be moved. If the array grows or shrinks, it may have its memory reallocated. If that happens, any pointers to slots in the array become invalid.

Do you know how we would be able to ensure the correctness of the optimizations involving table slots? In particular, we need to be careful about aliasing. If there are two variables referencing the same table, modifying the table via one variable might invalidate slots obtained via the second variable.

leokaplan commented 4 years ago

One thing that we have to be careful about is that the renormalize operation can't always be moved. If the array grows or shrinks, it may have its memory reallocated. If that happens, any pointers to slots in the array become invalid.

I think that for now it's better to just focus on the case where a[x] = ... a[x] (the exact same slot, without the risk of growing/shrinking). Maybe I am not seeing some case but i think that it would cover the problem.

In particular, we need to be careful about aliasing. If there are two variables referencing the same table, modifying the table via one variable might invalidate slots obtained via the second variable.

A dataflow analysis can track aliasing. Doing it inter-function and inter-module is kind of a hassle but doable. We can restrict the domain of the optimization for just simple aliasing (inside a function) for simplicity (and safeness) and then improve it.

hugomg commented 4 years ago

the exact same slot, without the risk of growing/shrinking

Even in that case we still need to be careful. For example, consider a[i] = a[i] + f(a). This is equivalent to

local x = a[i]
local y = f(a)
a[i] = x + y

and the call to f can potentially resize the array between the read and the write.

We can restrict the domain of the optimization for just simple aliasing (inside a function) for simplicity (and safeness) and then improve it.

Agreed. Restricting it to a single function sounds like a good idea.

A dataflow analysis can track aliasing

Sure. But at least to me it is not obvious how to use the results in this analysis to decide what optimizations can be performed. It might be helpful to have a clear plan of this before diving into the implementation.

leokaplan commented 4 years ago

I will try to consolidate some guidelines here and maybe change it if needed:

The optimization can be done in any assignment to an array indexing. Let's call the array in question X and with the indexing i
The right hand side of the assignment is an expression. Firstly let's only analyze the expression itself. Let's assume that there was a constant propagation and constant folding pass already performed. The expression then is a combination of operations on constants and :
1. only X[i] (trivial case)
2. X[constant] (has to perform some value-range analysis to see if constant < i (and maybe if constant > 1)) not sure if this is sufficient
3. f(X) (cancel the optimization for now, later we can have inter-functional analysis)
4. X[exp with vars] (cancel the optimization for now, later we can have a more sophisticated value range analysis)
Then we have to make sure of possible false positives such as:
1. aliasing (as mentioned), that is quite simple of finding. But then we have to track the uses and definitions of all aliases.
2. calling a locally defined function, that may use of the array as an upvalue, in which we cancel the optimization.

hugomg commented 4 years ago

Is this optimization going to be performed at the IR level, as a function that takes an IR as input and outputs an optimized IR? If so, then this is already after we have translated then the right hand side of a SetArr is an ir.Value, not an expression.

Let's consider a concrete example. Suppose that the IR is modified to have an ArrSlot operation that produces a pointer to an array slot, and a pair of GetSlot and SetSlot operations to read and write to a given array slot. In this version of the IR, the array renormalization would happen inside ArrSlot.

I think that this design would allow us to optimize xs[i] = xs[i] + 1 using common subexpression elimination. In this design, the increment to the array element would be converted to the following IR:

slot1 <- ArrSlot(xs, i)
t1    <- GetSlot(slot1)
t2    <- t1 + 1
slot2 <- ArrSlot(xs, i)
         SetSlot(slot2, t2)

and we would want to optimize that to remove the redundant ArrSlot:

slot1 <- ArrSlot(xs, i)
t1    <- GetSlot(slot1)
t2    <- t1 + 1
         SetSlot(slot1, t2)

This can be modeled as a Common Subespression Elimination optimization. According to the Tiger book, the building block for a CSE is a dataflow analysis of Available Expressions, which computes for each point in the control flow graph, what expressions are "in scope".

The basic idea of the available expressions analysis is that an instruction such as x <- y op z, adds x = y op z to the set of available expressions, and removes any expressions containing x from the set of available expressions.

We could also add a rule that says that we remove any array slots from the set of available expressions after a function call, or after a write to an array. This conservative approximation should be enough for xs[i] = xs[i] + 1 but might kill the optimization in other situations. (An improvement would be to use alias analysis and/or escape analysis to be less conservative, keeping more array slots in the set of available expressions).

Some questions:

Having the array renormalization happen inside the ArrSlot doesn't allow us to move the renormalization out of the loop, which is another optimization we are interested in. How would removing redundant renormalizes happen in a world where the renormalization is separate from ArrSlot?
Can the CSE-based optimization also remove redundant ArrSlot and/or renormalize operations in m[i][j] = m[i][j] + 1?

hugomg commented 4 years ago

Other question: is this available expressions analysis that I talked about here the same thing as the Reaching Definitions from the PR?

leokaplan commented 4 years ago

While I am thinking about your first comment, I can say that RD and AE aren't the same simply because their types are different: ---------------------------------------May----------------------------- Must ---------------- Forward ----------------Reaching definitions-------------Available expressions Backward------------------Live variables------------------Very busy expressions

reference slides, slide 21 Also, is the Tiger Book the one by Andrew W. Appel? I didn't knew it .

hugomg commented 4 years ago

Thanks, that's an interesting set of slides.

Yes, the Tiger book is the Appel one. I use the original version, with ML.

leokaplan commented 4 years ago

Coming back to the questions:

Having the array renormalization happen inside the ArrSlot doesn't allow us to move the renormalization out of the loop, which is another optimization we are interested in. How would removing redundant renormalizes happen in a world where the renormalization is separate from ArrSlot?

It is quite hard to identify slot1 <- ArrSlot(xs, i) as an invariant, since the i parameter change with the loop progress. One way is to separate ArrSlot(xs,i) in two: GetSlot(xs,i) and Validate(xs) and we make a RD-like analysis to check that Validate is invariant in respect to the loop (and thus hoisting it); another way of doing it, on the same idea is to add a boolean third parameter to ArrSlot, indicating if it should perform renormalize or not. We could then set this flag based on the result of the analysis; Then I thought of another option, somewhat more naive, in which we mark internally the array if the size has changed and decide to perform renormalize based on that, adding a check at runtime, that maybe gcc could ignore. This last option has several implications, so I am not taking it too seriously, but as always, moving a static analysis to the dynamic realm should be an option...

Can the CSE-based optimization also remove redundant ArrSlot and/or renormalize operations in m[i][j] = m[i][j] + 1?

It depends on how we capture the facts. We can opt for seeing m[i][j] as a single variable or seeing it as 3 separate vars (m,i and j) (or seeing it as m[i] and j...). In the first case we get the trivial cases for free, but it's impossible to check for more complex cases. As we get more information we would have to make analysis that consider killing and generating facts based on partial values. Maybe make a number of passes based on the number of indexings? A first pass dealing with m[i][j], then one dealing with m[i] and j, etc... I believe that I've already seen some literature on the topic. I will try to bring more concrete strategies to deal with that.

hugomg commented 4 years ago

There are two things that we currently do every time we access an array in Pallene: 1) we renormalize the table to ensure that the index we are looking for is in the array part and 2) we get the TValue * corresponding to the array slot we are interested in.

There are also two optimizations that we have been thinking about here, one for each of those two operations. The first is reusing array slots in operations such as xs[i] = xs[i] + 1. For this we need some sort of ArrSlot operation in the IR, instead of the GetArr and SetArr. This optimization can be seen as a variant of common subexpression elimination, with the restriction that slots are invalidated anytime the table is reallocated (a renormalize the grows the table

The second optimization we want to do is avoiding repeated renormalizes in a loop. For example, if we have

local xs = {}
for i = 1, N do
   xs[i] = 0
end

then in theory we can resize the array to have size N before the start of the loop and then the array doesn't need to be grown inside the loop. For this second optimization to happen we need to have a Renormalize operation in the IR that is separate from GetSlot.

In theory we could at first implement only the first optimization, leaving the renormalize functionality inside GetSlot and, The Renormalize operation would be left for a second step, when we would implement the second optimization. This was what I was alluding to in Question 1.

Nevertheless, it might be simpler to understand what is going on if we separate the Renormalize and the GetSlot right from the start.

Validate(xs)

The renormalize operation needs to receive an array index as a parameter. That is, it needs to be Renormalize(xs, i) and not Renormalize(xs). What the renormalize does is that it ensures that that all indexes up to the given index i are in the array part. For that you need to know the index.

It is quite hard to identify slot1 <- ArrSlot(xs, i) as an invariant

The optimization of moving the renormalize out of the loop isn't really a loop-invariant code motion. It might be better to see it as a variant of bounds-check elimination. What we want to do is replace a sequence of Renormalize(xs, 1), Renormalize(xs, 2), ... , Renormalize(xs, n) with a single Renormalize(xs, n) outside the loop.

The basic idea is that a Renormalize(xs, i) allows us to delete a subsequent Renormalize(xs, j) if i >= j and xs has not been renormalized in between. This is similar to how a bounds check of len(xs) >= i allows us to get rid of a subsequent check of len(xs) >= j if i >= j.

We can opt for seeing m[i][j] as a single variable or seeing it as 3 separate vars (m,i and j) (or seeing it as m[i] and j...).

The IR will already write m[i][j] as separate statements so I suspect that reasoning about multi-level array accesses may be swimming against the current

a <- m[i]
b <- a[j]

We also have to consider that in Lua it is technically possible to assign m[i] = m, meaning that a and m alias each other. This kind of thing makes it difficult to reason about m[i][j] as a single entity... It also inhibits some optimizations that would be possible in C, because C has stricter aliasing rules than Pallene. (For example, in C an int** cannot alias a int*).

That said, after thinking more about it I think the answer to my Question 2 is yes. Consider the example m[i][j] = m[i][j] + 1. And ignore the Renormalize for now, think of just the ArrSlot.

The example gets translated to this IR:

s1 <- ArrSlot(m, i) # from LHS
a  <- Get(s1)
s2 <- ArrSlot(m, 1) # from RHS
b  <- Get(s2)
s3 <- ArrSlot(b, j)
x  <- Get(s3)
y  <- x + 1
s4 <- ArrSlot(a, j)
      Set(s4, y)

Common subexpression elimination for slots lets us replace all uses of s2 with s1

s1 <- ArrSlot(m, i)
a  <- Get(s1)
b  <- Get(s1)
s3 <- ArrSlot(b, j)
x  <- Get(s3)
y  <- x + 1
s4 <- ArrSlot(a, j)
      Set(s4, y)

Common subexpression elimination lets replace all uses of b with a

s1 <- ArrSlot(m, i)
a  <- Get(s1)
s3 <- ArrSlot(a, j)
x  <- Get(s3)
y  <- x + 1
s4 <- ArrSlot(a, j)
      Set(s4, y)

Finally, common subexpression elimination for slots also us replace s4 with s3

s1 <- ArrSlot(m, i)
a  <- Get(s1)
s3 <- ArrSlot(a, j)
x  <- Get(s3)
y  <- x + 1
      Set(s3, y)

This is the optimized version that we wanted.

hugomg commented 4 years ago

On the topic of slot invalidation... I checked the Lua source code again to check when exactly a table can be reallocated.

The only time where the array part and the hash part may changes when we write to a key that does not currently exist in the table. If there is no room in the table to store that key then Lua calls the rehash function to create space for the key. This rehash operation grows the array part or the hash part, whichever is appropriate in that case. The other part that didn't grow may stay the same size or it may shrink.

pallene-lang / pallene

Redundant renormalize_array causing slowdown #197