inducer / loopy

A code generator for array-based code on CPUs and GPUs
http://mathema.tician.de/software/loopy
MIT License
589 stars 73 forks source link

Millisecond loopy #117

Open inducer opened 4 years ago

mattwala commented 4 years ago

A set of transformation/code generation benchmarks might be useful.

isuruf commented 4 years ago

There's a pickled loopy kernel at https://gitlab.tiker.net/inducer/loopy/-/issues/213

isuruf commented 4 years ago

Anyone have any ideas for benchmarks? The medium sized sumpy kernel takes way too much time at the moment. (20 mins per benchmarks run)

See http://koelsch.d.tiker.net:8000 (Need to be inside UIUC firewall)

inducer commented 4 years ago

@isuruf, can you get the time down by dialing down the order? (I suspect you tried, asking just for completeness.)

@kaushikcfd, Do you have some Firedrake kernels that we could use here?

kaushikcfd commented 4 years ago

I have one: big_kernel.py[6.5MB], but on the current master takes around 50 minutes with 15GB of peak memory requirements on koelsch.

inducer commented 4 years ago

@wence- provided some stats here on a big kernel from Firedrake. (after #136 and gitlab !408).

My initial read of the stats:

@wence-, if it is easy, could you run the same thing through https://github.com/joerick/pyinstrument/ and provide the .pyireport? This does a better job gathering the call graph, which might make it easier to draw conclusions.

wence- commented 4 years ago

What kind of sizes do you want. This is an extreme case. But it is easy to provide real problems that take 10s of seconds.

inducer commented 4 years ago

Give me the one(s) you feel are the least reasonable for their size, so that we can get those dealt with first.

wence- commented 4 years ago

@wence-, if it is easy, could you run the same thing through https://github.com/joerick/pyinstrument/ and provide the .pyireport? This does a better job gathering the call graph, which might make it easier to draw conclusions.

Here we go (1GB decompressed) 1451-kernel.pyireport.gz

inducer commented 4 years ago

Thanks! That helps quite a bit in breaking things down. Better breakdown:

Total time: 348s

If you turn of the checks, I think 20s or so might be within reach with some more work and the checks turned off, but we'll see. Thanks for providing this!

wence- commented 4 years ago

of which about 30s in type inference. This has bugs anyway (https://gitlab.tiker.net/inducer/loopy/-/issues/132) and probably needs a better algorithm.

Given that you topo-sort the assignments. You can infer in linear time by requiring that the roots are typed and propagating through the topological order, no?

We do this in tsfc here: https://github.com/firedrakeproject/tsfc/blob/f14bb4f394a438e7f68d8f6598f11c3f7d4f87ee/tsfc/loopy.py#L86

Right now we only use this to give types to temporaries and arguments, but if you want the statements to be typed too we probably have that information and can hand it over.

I should also check how long the ast->C step takes...

wence- commented 4 years ago

I should also check how long the ast->C step takes...

This is fast, however, I attach the profiler report for the first TSFC pass (UFL->loopy). You can see that make_kernel takes about 40% of the 40s here. tsfc-tetrahedron-1451.pyireport.gz

wence- commented 4 years ago

Here are some smaller ones. a curl-curl problem, and then a quad-curl problem (the latter is a bit contrived, but we should have some elements soon where you can do this in primal rather than mixed form).

TSFC takes ~3.5s and ~11s respectively (again about 40% of this is in make_kernel), loopy then takes ~35s and ~125s respectively.

In comparison to the bigger kernel these have fewer instructions, but more complicated loop nests.

Thanks for the work on this, it's definitely appreciated at this end, since getting the compile times down is a big win for user experience. curl-curl-loopy.py.gz curl-curl-loopy.pyireport.gz curl-curl-tsfc.pyireport.gz

quad-curl-loopy.py.gz quad-curl-loopy.pyireport.gz quad-curl-tsfc.pyireport.gz

inducer commented 4 years ago

See also #148.

isuruf commented 3 years ago

All of loopy takes about 680s. 184s of that in check_variable_access_ordered. (which can be turned off, and which was the focus of !408)

With #281, this is 4s

166s in codgen.

With #280, this is 15s now

100s in preprocessing (pending breakdown)

With #270, #271, #274, this is 3s

Overall it's 82s

wence- commented 3 years ago

Returning to this again, just to keep track of what I should be doing (since we've had some more big kernels that take a while to compile).

I attach a pickled kernel that takes about 80s to do loopy.generate_code_v2(kernel). This is with the checking switched off (I think).

The kernel has about 14000 instructions. Generating code from a kernel that is less optimised (in terms of flop count) but does the same computation with ~1400 instructions takes about 8s. So it seems like most stuff is probably linear, except that it takes 5ms per instruction to generate the IR.

I tried with #372 (not sure if this or the other bits in it is meant to address codegen speed). That gave me a warning: UserWarning: Falling back to a slow scheduler implementation due to: cannot schedule kernels with priority dependencies between sibling loop nests. And takes ~75s, so not really a big change.

Am I doing all the things currently in main to address these kind of speed issues? LMK if you want any experiments doing.

slow-kernel.pickle.gz

kaushikcfd commented 3 years ago

Hi Lawrence, Thanks for the sharing the kernel. The flamegraph (download for interactive experience) on main for the entire lowering pipeline tells us these are the biggest offenders:

372 was mainly targeted towards lowering the code-generation time for kernels with many loop nests, so might not be very helpful for kernels with small numbers of loops, but large number of statements.