Open inducer opened 4 years ago
There's a pickled loopy kernel at https://gitlab.tiker.net/inducer/loopy/-/issues/213
Anyone have any ideas for benchmarks? The medium sized sumpy kernel takes way too much time at the moment. (20 mins per benchmarks run)
See http://koelsch.d.tiker.net:8000 (Need to be inside UIUC firewall)
@isuruf, can you get the time down by dialing down the order? (I suspect you tried, asking just for completeness.)
@kaushikcfd, Do you have some Firedrake kernels that we could use here?
I have one: big_kernel.py[6.5MB], but on the current master takes around 50 minutes with 15GB of peak memory requirements on koelsch.
@wence- provided some stats here on a big kernel from Firedrake. (after #136 and gitlab !408).
My initial read of the stats:
check_variable_access_ordered
. (which can be turned off, and which was the focus of !408)@wence-, if it is easy, could you run the same thing through https://github.com/joerick/pyinstrument/ and provide the .pyireport
? This does a better job gathering the call graph, which might make it easier to draw conclusions.
What kind of sizes do you want. This is an extreme case. But it is easy to provide real problems that take 10s of seconds.
Give me the one(s) you feel are the least reasonable for their size, so that we can get those dealt with first.
@wence-, if it is easy, could you run the same thing through https://github.com/joerick/pyinstrument/ and provide the .pyireport? This does a better job gathering the call graph, which might make it easier to draw conclusions.
Here we go (1GB decompressed) 1451-kernel.pyireport.gz
Thanks! That helps quite a bit in breaking things down. Better breakdown:
Total time: 348s
check_variable_access_ordered
and 47s in bounds checking (we can make both of those optional, e.g. via options)get_admissible_conditional_inames_for
, which @kaushikcfd/@isuruf's next patch targetsIf you turn of the checks, I think 20s or so might be within reach with some more work and the checks turned off, but we'll see. Thanks for providing this!
of which about 30s in type inference. This has bugs anyway (https://gitlab.tiker.net/inducer/loopy/-/issues/132) and probably needs a better algorithm.
Given that you topo-sort the assignments. You can infer in linear time by requiring that the roots are typed and propagating through the topological order, no?
We do this in tsfc here: https://github.com/firedrakeproject/tsfc/blob/f14bb4f394a438e7f68d8f6598f11c3f7d4f87ee/tsfc/loopy.py#L86
Right now we only use this to give types to temporaries and arguments, but if you want the statements to be typed too we probably have that information and can hand it over.
I should also check how long the ast->C step takes...
I should also check how long the ast->C step takes...
This is fast, however, I attach the profiler report for the first TSFC pass (UFL->loopy). You can see that make_kernel
takes about 40% of the 40s here.
tsfc-tetrahedron-1451.pyireport.gz
Here are some smaller ones. a curl-curl problem, and then a quad-curl problem (the latter is a bit contrived, but we should have some elements soon where you can do this in primal rather than mixed form).
TSFC takes ~3.5s and ~11s respectively (again about 40% of this is in make_kernel
), loopy then takes ~35s and ~125s respectively.
In comparison to the bigger kernel these have fewer instructions, but more complicated loop nests.
Thanks for the work on this, it's definitely appreciated at this end, since getting the compile times down is a big win for user experience. curl-curl-loopy.py.gz curl-curl-loopy.pyireport.gz curl-curl-tsfc.pyireport.gz
quad-curl-loopy.py.gz quad-curl-loopy.pyireport.gz quad-curl-tsfc.pyireport.gz
See also #148.
All of loopy takes about 680s. 184s of that in check_variable_access_ordered. (which can be turned off, and which was the focus of !408)
With #281, this is 4s
166s in codgen.
With #280, this is 15s now
100s in preprocessing (pending breakdown)
With #270, #271, #274, this is 3s
Overall it's 82s
Returning to this again, just to keep track of what I should be doing (since we've had some more big kernels that take a while to compile).
I attach a pickled kernel that takes about 80s to do loopy.generate_code_v2(kernel)
. This is with the checking switched off (I think).
The kernel has about 14000 instructions. Generating code from a kernel that is less optimised (in terms of flop count) but does the same computation with ~1400 instructions takes about 8s. So it seems like most stuff is probably linear, except that it takes 5ms per instruction to generate the IR.
I tried with #372 (not sure if this or the other bits in it is meant to address codegen speed). That gave me a warning: UserWarning: Falling back to a slow scheduler implementation due to: cannot schedule kernels with priority dependencies between sibling loop nests
. And takes ~75s, so not really a big change.
Am I doing all the things currently in main
to address these kind of speed issues? LMK if you want any experiments doing.
Hi Lawrence,
Thanks for the sharing the kernel. The flamegraph (download for interactive experience) on main
for the entire lowering pipeline tells us these are the biggest offenders:
check_bounds
.
A set of transformation/code generation benchmarks might be useful.