Open mohamedadaly opened 6 years ago
It's going back and forth to memory because conv is scheduled compute_root. If you schedule it compute_at tiles of relu instead, such that the size of conv needed for one tile is small enough to fit in registers, it should get promoted into registers. See apps/linear_algebra for example gemm schedules.
Thanks!
But the problem is that relu doesn't know about the RDom in conv (the reduction over the cols of lhs = rows of rhs), and I want to split over that dimension using this order
.reorder(x, y, ri, xr, yr, yc, xc, r)
basically to achieve a blocking/tiling structure like this.
Is there a trick to achieve this ordering and at the same time promote the inner loop to registers?
So you want to do some of the summation in the innermost loop, and more in the outermost loop. I think you need to factor that reduction into two stages, using Func::rfactor, so that ri and r belong to two distinct Funcs.
Func::rfactor did the trick. Thanks a lot :)
One more question: Is it possible to specialize
the compute_at
level of a Func? When I do that I get this error
conv.specialize(split).compute_root();
error: ‘class Halide::Stage’ has no member named ‘compute_root’
I can do it with a GeneratorParam at compile time to fork two separate paths, but it would be easier if there is a way to do it conditionally at run time.
You can, but it's a bit counterintuitive. You need to make a compute_at on a single variable that's valid on both sides of the specialization, but has a different meaning by virtue of where that variable shows up in the loop nest. E.g.
Var x, y, dummy;
consumer(x, y) = producer(x, y);
// Make a dummy var of size 1, and then conditionally reorder it to be outermost
consumer.split(x, x, dummy, 1);
consumer.specialize(c).reorder(x, y, dummy); // if c dummy is outermost loop
consumer.reorder(dummy, x, y); // else it's the innermost loop
producer.compute_at(consumer, dummy);
Is there a similar trick to make it also work for bound_extent
and store_in
etc. that are not defined for Halide::Stage
? Or more generally, is there a way to have conditional branching at run time, similar to specialize
but that allow arbitrary statements?
Thanks again!
I am trying to write a generator to perform a GEMM operation. The outline is below.
The problem with generated code is that the inner kernel (loop in conv inside the inner tile [xr, yr]) keeps loading/storing into memory for the accumulator. This wastes a lot of cpu cycles, and could be made much faster. For example, loading all the 4x4 elements of the conv array into registers outside this loop, and then multiplying-accumulating using
vmla.f32
inside the loop would be much faster than loading/storing inside the loop.Is there a way to do this using scheduling i.e. to force the inner loop to be stored in registers?
Thanks.