reduce compile time of stack construction and map

rafaqz commented 1 year ago

This PR reduces compile time of DimStack and functions relying on map by avoiding intermediate NamedTuples of DimArray. For whatever reason these are much slower to compile than Tuples of DimArray.

There is also some work to use Vector rather than Tuple for some methods, so that compilation only has to happen once and map is cheaper to compile.

Also some @nospecialise to reduce compilation (it helps even when the method is not called directly, oddly enough)

@sethaxen this has a lot less recompilation for stacks with different numbers of layers - for map and DimStack. It would be good to know if it helps with your specific workloads.

Now that precompiling "sticks" a bit, we could add some precompile statements to remove even the first compilation for some methods.

codecov-commenter commented 1 year ago

Codecov Report

Merging #471 (f315588) into main (efa31ed) will increase coverage by 0.25%. The diff coverage is 95.56%.

@@            Coverage Diff             @@
##             main     #471      +/-   ##
==========================================
+ Coverage   89.73%   89.99%   +0.25%     
==========================================
  Files          38       39       +1     
  Lines        2660     2928     +268     
==========================================
+ Hits         2387     2635     +248     
- Misses        273      293      +20

Impacted Files	Coverage Δ
src/DimensionalData.jl	`100.00% <ø> (ø)`
src/Dimensions/dimension.jl	`88.52% <ø> (ø)`
src/utils.jl	`85.71% <83.33%> (-9.12%)`	:arrow_down:
src/Dimensions/primitives.jl	`93.30% <88.88%> (-0.37%)`	:arrow_down:
src/stack/methods.jl	`96.15% <93.75%> (+0.69%)`	:arrow_up:
src/stack/stack.jl	`91.66% <94.44%> (-1.39%)`	:arrow_down:
src/array/show.jl	`95.09% <95.31%> (-1.25%)`	:arrow_down:
src/Dimensions/show.jl	`95.77% <100.00%> (+0.06%)`	:arrow_up:
src/LookupArrays/lookup_arrays.jl	`82.77% <100.00%> (+0.09%)`	:arrow_up:
src/LookupArrays/show.jl	`84.72% <100.00%> (+0.66%)`	:arrow_up:
... and 2 more

... and 1 file with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

rafaqz commented 1 year ago

This ended up being quite rambling. Compile time of show was a big factor in REPL performance, so I fixed that too to some extent.

There's a proper SnoopPrecompile.jl block now that seems to help quite a bit with show.

More work could be done with SnoopCompile getting SnoopPrecompile to precompile more common use cases.

Other than that, simplification of the DimStack type could be done

move keys to a K parameter
use Tuple for data, layermetadata and layerdims

But that would be a breaking change.

If all stacks share the same dims we could skip layerdims (it could be nothing and just return dims.

combinedims and rebuild_from_arrays could be further optimised for compilation time, they're currently the worst offenders with most operations on a stack that return another stack passing through them.

We may be able to also drop combinedims when all dims are shared between layers.

sethaxen commented 1 year ago

@rafaqz I'm traveling right now but will try to review this weekend.

rafaqz commented 1 year ago

No rush

rafaqz commented 1 year ago

I also seem to have messed up the formatting a bit optimising show

rafaqz commented 1 year ago

The idea with the show optimisation is that in practice it was making everything else feel slower -often show was taking longer than the methods that led to it. So:

precompile the show methods up front for some basic types (its half base methods that are not precompiled like printstyled)
reduce compile time for the inevitable custom dim and array type combinations to something like 0.2-0.3 seconds.

Getting it down to 0.1 seconds would be best, so you basically don't notice any lag, but its hard to get there.

sethaxen commented 1 year ago

Indeed, I'm seeing show be much snappier. In particular the first display(::DimArray) for typical arrays used to take 0.52s but now takes 0.17s. display(::AbstractDimStack) went from 1.27s to 0.96s; do you think this could be sped up further?

With these changes we don't see much of a reduction in TTFX for I/O and common analyses. After performing the analysis once on one AbstractDimStack, the compile time for a different AbstractDimStack is substantially reduced, so precompilation could help us a lot here. But I wonder if the implementations are doing things that substantially increase compile time. How have you been going about identifying the compile time bottlenecks in DimensionalData?

rafaqz commented 1 year ago

I've been using @time and SnoopCompile @snoopi_deep on the furst run of various methods and visualising the output with ProfileView.jl. Also Cthulhu @descend to understand what is being compiled. Its a little clunky as a workflow as you need to keep restarting julia, but it works.

Youre right, we could precompile more things to make the first runs better too. Feel free to PR more, but we can keep working on this over time too. Its also a good excercise to understand the tradeoffs to guide future design decisions.

rafaqz commented 1 year ago

I can post my ttfx benchmark script here when I get back to copenhagen (tonight or tomorrow), if you want something to get started.

rafaqz commented 1 year ago

using SnoopCompile
using ProfileView
@time using DimensionalData
# trees = invalidation_trees(invalidations)
# using Cthulhu

# Define dimarrays
A = [1.0 2.0 3.0; 4.0 5.0 6.0]
x, y, z = X([:a, :b]), Y(10.0:10.0:30.0; metadata=Dict()), Z()
dimz = x, y
da1 = permutedims(DimArray(A, (x, y); name=:one));
da2 = DimArray(Float16.(2A), (x, y); name=:two)
da3 = DimArray(Int.(3A), (x, y); name=:three);
da4 = DimArray(cat(4A, 5A, 6A, 7A; dims=3), (x, y, z); name=:exteradim);

# Snoop display
tinf = @snoopi_deep @time display(da1)
tinf = @snoopi_deep @time display(da2)
tinf = @snoopi_deep @time display(da3)
tinf = @snoopi_deep @time display(da4)

# View flamegraph of some snoop
fg = flamegraph(tinf)
ProfileView.view(fg)

# Some mixed type DimArrays to generate random stacks from
das = (da1, da2, da3, da4)
# Define random stacks
many_das1 = Tuple(rand(das, 14)); @time s1 = DimStack(many_das1) 
many_das2 = Tuple(rand(das, 14)); @time s2 = DimStack(many_das2) 

# Merging stacks
tinf = @snoopi_deep @time merge(s1, s2)

s = s1
tinf = @snoopi_deep @time display(s)

# Some functions over stacks
tinf = @time @snoopi_deep map(sum, s)
tinf = @snoopi_deep @time map(A -> A[X=1], s)
tinf = @snoopi_deep @time prod(s)

fg = flamegraph(tinf)
ProfileView.view(fg)

rafaqz / DimensionalData.jl

reduce compile time of stack construction and map #471

Codecov Report