timholy / SnoopCompile.jl

Provide insights about latency (TTFX) for Julia packages
https://timholy.github.io/SnoopCompile.jl/dev/
Other
309 stars 48 forks source link

inference/LLVM codegen/native codegen time split #251

Open bkamins opened 3 years ago

bkamins commented 3 years ago

If I understand things correctly SnoopCompile.jl currently allows to dig into inference timings. It would be great to be able to similarly dig into LLVM codegen and native codegen timings.

This is relevant for packages like DataFrames.jl, where in many cases it is LLVM codegen and/or native codegen timing a chief time of running some functions (and this is unavoidable since for many functions users pass anonymous functions as an argument).

An example of such a case is here: https://github.com/JuliaData/DataFrames.jl/issues/2806

NHDaly commented 3 years ago

@bkamins you already currently can dig into LLVM optimization and native codegen timings, via @snoopl and @snoopc.

We haven't written big blog posts about those yet, I think, but there is at least some information in the documentation site:

Part of the difficultly, though, is that the signal you get from those two measurements is less useful than the signal you get from @snoopi_deep.

This is because for inference, julia currently performs type inference via a depth-first traversal, so by measuring trace spans, we get a nice flamegraph that you can look at, to understand both inclusive and exclusive inference timings.

For LLVM optimization on the other hand, we pass functions through LLVM optimization in a somewhat random order, batched together arbitrarily, which makes it more difficult to account optimization time to individual functions. @snoopl currently emits:

  1. An array of Pairs of: the amount of time compiling each "batch", or module, and the type signature for the method instances in that module.
  2. A Dict of before/after statistics about each method instance in the module, mapping type signature to stats like number of instructions before/after optimization.

And finally, I believe @snoopc provides a mapping of the time to codegen each method instance.


Hope that's helpful! :) If you find any of of that useful, documentation PRs are always welcome ❤️ :)

bkamins commented 3 years ago

you already currently can dig into LLVM optimization and native codegen timings, via @snoopl and @snoopc.

The problem is that they start a new Julia session and I need to run them in an already running session so that I do not measure the cost of first compilation. I am interested only in consecutive compilation times. Is this possible?

bkamins commented 3 years ago

Reproducible example on Julia nightly:

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.0-DEV.262 (2021-07-24)
 _/ |\__'_|_|_|\__'_|  |  Commit 0cbb6850fb (0 days old master)
|__/                   |

julia> using DataFrames, SnoopCompile, Statistics

julia> d = DataFrame(n = 1:20, x = [3, 3, 3, 3, 1, 1, 1, 2, 1, 1,
                                    2, 1, 1, 2, 2, 2, 3, 1, 1, 2]);

julia> g = groupby(d, :x);

julia> subset(g,  :n => (n -> n .> mean(n))); # compile

julia> select(g,  :n => (n -> n .> mean(n))); # compile

julia> @snoopi_deep begin
           subset(g,  :n => (n -> n .> mean(n)))
       end
InferenceTimingNode: 0.117688/0.189961 on Core.Compiler.Timings.ROOT() with 15 direct children

julia> @snoopi_deep begin
           subset(g,  :n => (n -> n .> mean(n)))
       end
InferenceTimingNode: 0.111079/0.182713 on Core.Compiler.Timings.ROOT() with 15 direct children

julia> @snoopi_deep begin
           subset(g,  :n => (n -> n .> mean(n)))
       end
InferenceTimingNode: 0.113283/0.185833 on Core.Compiler.Timings.ROOT() with 15 direct children

julia> @snoopi_deep begin
           subset(g,  :n => (n -> n .> mean(n)))
       end
InferenceTimingNode: 0.112419/0.182522 on Core.Compiler.Timings.ROOT() with 15 direct children

julia> @snoopi_deep begin
           select(g,  :n => (n -> n .> mean(n)))
       end
InferenceTimingNode: 0.085624/0.144820 on Core.Compiler.Timings.ROOT() with 7 direct children

julia> @snoopi_deep begin
           select(g,  :n => (n -> n .> mean(n)))
       end
InferenceTimingNode: 0.082723/0.141071 on Core.Compiler.Timings.ROOT() with 7 direct children

julia> @snoopi_deep begin
           select(g,  :n => (n -> n .> mean(n)))
       end
InferenceTimingNode: 0.081279/0.139118 on Core.Compiler.Timings.ROOT() with 7 direct children

julia> @snoopi_deep begin
           select(g,  :n => (n -> n .> mean(n)))
       end
InferenceTimingNode: 0.081399/0.132495 on Core.Compiler.Timings.ROOT() with 7 direct children

And I am discussing the difference between ~0.11 for subset and ~0.08 for select (the first number reported; I do not care now about inference). I would like to know how to diagnose what causes the difference. Thank you!

NHDaly commented 3 years ago

Ah, I understand! Yes, I fully agree!! I actually have a branch on my laptop that adds support for this, but never pushed it up for some dumb reason...

I'm out for the day, but when I get home tomorrow I'll push the branch up and open a PR, and maybe you can help me wrap it up and get it over the finish line? :)

Basically it's just adding flag to all those functions that control whether to run the measurements in a separate process or in the current one.

And then I had a mega-function, I think I called it @snoo_all which runs all the analyses together in the same process. :)

Will find it and push it up tomorrow!

bkamins commented 3 years ago

Excellent. Thank you! Please ping me in the PR.

timholy commented 3 years ago

There's definitely some weird recompilation going on.

@snoopi_deep begin
              select(g,  :n => (n -> n .> mean(n)))
          end

seems to reinfer

mean(::SubArray{Int64, 1, Vector{Int64}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false})

every time. It doesn't seem to be constant-prop. However,

julia> function mymean(v)
           s = zero(eltype(v))
           for item in v
               s += item
           end
           return s/length(v)
       end

doesn't show exactly the same thing (the broadcasting part gets reinferred, but not mymean).

timholy commented 3 years ago

(You can display the result in a flamegraph and click on the bars in ProfileView.) You can also find it in print_tree(tinf).

timholy commented 3 years ago

I still don't know what's going on at the level of the compiler, but if I make this change:

~/.julia/dev/DataFrames$ git diff
diff --git a/src/groupeddataframe/callprocessing.jl b/src/groupeddataframe/callprocessing.jl
index 740a9df7..d0c079b4 100644
--- a/src/groupeddataframe/callprocessing.jl
+++ b/src/groupeddataframe/callprocessing.jl
@@ -87,7 +87,7 @@ function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
     end
 end

-function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
+function do_call(@nospecialize(f::Base.Callable), idx::AbstractVector{<:Integer},
                  starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                  gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer)
     idx = view(idx, starts[i]:ends[i])

then I cut the time roughly in half and I don't recompile mean. Inference profile looks like this:

image

which is a case in which inference is no longer a substantial bottleneck.

bkamins commented 3 years ago

But if you do @nospecialize here we will have a dynamic dispatch inside this function (and do_call is the only compute intensive part of the whole process).

I think it is better to benchmark the decision about adding @nospecialize using this source data frame:

d = DataFrame(n = 1:100_000_000, x = repeat(1:1_000_000, 100));

as in it we call do_call 10^6 times.

bkamins commented 3 years ago

In this case this indeed improves performance (which is strange): released version timings

julia> @time select(g,  :n => (n -> n .> mean(n)));
  3.737252 seconds (9.80 M allocations: 2.932 GiB, 19.78% gc time, 5.48% compilation time)

julia> @time select(g,  :n => (n -> n .> mean(n)));
  3.757369 seconds (9.80 M allocations: 2.932 GiB, 18.96% gc time, 5.96% compilation time)

your proposal with @nospecialize timings:

julia> @time select(g,  :n => (n -> n .> mean(n)));
  3.694405 seconds (10.07 M allocations: 2.971 GiB, 16.48% gc time, 3.56% compilation time)

julia> @time select(g,  :n => (n -> n .> mean(n)));
  3.699847 seconds (10.07 M allocations: 2.971 GiB, 19.24% gc time, 3.44% compilation time)

But if we use a simpler function the situation is reversed: released version timings:

julia> @time select(g,  :n => x -> first(x));
  2.427769 seconds (187.73 k allocations: 3.751 GiB, 18.02% gc time, 6.28% compilation time)

julia> @time select(g,  :n => x -> first(x));
  2.475404 seconds (187.72 k allocations: 3.751 GiB, 22.28% gc time, 5.05% compilation time)

your proposal with @nospecialize timings:

julia> @time select(g,  :n => x -> first(x));
  2.563807 seconds (2.19 M allocations: 3.840 GiB, 20.53% gc time, 4.47% compilation time)

julia> @time select(g,  :n => x -> first(x));
  2.664046 seconds (2.19 M allocations: 3.840 GiB, 20.78% gc time, 4.52% compilation time)
timholy commented 3 years ago

Thanks for specifying a case where the performance might matter. I don't think do_call is such a performance hotspot: it is getting called by dynamic dispatch from within _combine_process_pair, so I don't think it matters if you just defer that runtime dispatch to something within do_call. The ideal circumstance is that you delay runtime dispatch until something that is intrinsically expensive; from what I can tell, do_call just calls f(column_view) and so it's not doing anything element-wise and so shouldn't be performance-sensitive.

I hadn't noticed until I found the caller just by inserting an error that you're using tasks. That may help me figure out a MWE for the compiler issue.

bkamins commented 3 years ago

_combine_process_pair calls it only once so it does not matter, but it gets called in a hot loop in two other places:

and I would assume that these two other places it matters to avoid dynamic dispatch as in both of them do_call should be called by static dispatch (and this is what I understand happens).

The call in _combine_process_pair is only needed to determine the type of result retuned by do_call so that we can choose a proper code processing path later. And these two other call places of do_call actually relate to these paths (one is for processing cases when data with multiple rows is returned by do_call and the other for all cases when do_call returns one row of data).

If you start Julia with multiple threads these operations are supporting multi-threading indeed.

timholy commented 3 years ago

Can you give me a workload that is guaranteed to call those? Maybe this one does, but if so:

julia> using MethodAnalysis

julia> include("select_subset_test.jl")

julia> mctwf = methods(DataFrames._combine_tables_with_first!).ms[1]
_combine_tables_with_first!(first::Union{AbstractDataFrame, NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}, outcols::Tuple{Vararg{AbstractVector, N}}, idx::Vector{Int64}, rowstart::Integer, colstart::Integer, f::Union{Function, Type}, gd::GroupedDataFrame, incols::Union{Nothing, Tuple, AbstractVector, NamedTuple}, colnames::Tuple{Vararg{Symbol, N}}, firstmulticol::DataFrames.FirstColCount) where N in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:354

julia> methodinstances(mctwf)
3-element Vector{Core.MethodInstance}:
 MethodInstance for DataFrames._combine_tables_with_first!(::NamedTuple{(:x1,), Tuple{BitVector}}, ::Tuple{Vector{Bool}}, ::Vector{Int64}, ::Int64, ::Int64, ::Function, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Tuple{Symbol}, ::DataFrames.FirstSingleCol)
 MethodInstance for DataFrames._combine_tables_with_first!(::NamedTuple{(:x1,), _A} where _A<:Tuple{AbstractVector}, ::Tuple{AbstractVector}, ::Vector{Int64}, ::Int64, ::Int64, ::Function, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Tuple{Symbol}, ::DataFrames.FirstSingleCol)
 MethodInstance for DataFrames._combine_tables_with_first!(::Union{AbstractDataFrame, NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}, ::Tuple{Vararg{AbstractVector, N}}, ::Vector{Int64}, ::Int64, ::Int64, ::Union{Function, Type}, ::GroupedDataFrame{DataFrame}, ::Union{Nothing, Tuple, NamedTuple, AbstractVector}, ::Tuple{Vararg{Symbol, N}}, ::Union{DataFrames.FirstMultiCol, DataFrames.FirstSingleCol}) where N

so it doesn't seem to specialize on f (easiest to find as the argument after the two Ints) despite the absence of @nospecialize (Julia's heuristics are probably at play). But if I'm not really calling it then this could be misleading. I did comment out DataFrames' precompilation directives just to ensure that all specialization happens by natural dispatch. UPDATE: I ran DataFrames "grouping.jl" test script and then got this:

julia> utyps = Base.IdSet(); for mi in methodinstances(mctwf)
           push!(utyps, Base.unwrap_unionall(mi.specTypes).parameters[7])
       end

julia> utyps
Base.IdSet{Any} with 2 elements:
  Function
  Union{Function, Type}

There were 153 MethodInstances, so this seems likely to be all the type-diversity for that argument.

For the other, it's definitely not being called in this workload (there are no MethodInstances). UPDATE: after running the tests there are 1343 MethodInstances (wow!) with 317 different types for f. So this one is being specialized.

Bottom line, if it's not specialized on the specific function, then runtime dispatch gets used. In such cases there should be no cost to deferring it until you get to actual work performed by the function.

In your benchmarking, be careful of potential confusion from world-splitting. If you have 3 or fewer applicable MethodInstances Julia will use world-splitting, but define a 4th method and it will use runtime dispatch. That can distort your estimate of performance hotspots. You may have to decide whether you want to optimize the very first use or the use case where the user has supplied many different functions in a single session. In the demo above do_call is now over the threshold:

julia> m = methods(DataFrames.do_call).ms[2]
do_call(f::Union{Function, Type}, idx::AbstractVector{<:Integer}, starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/callprocessing.jl:90

julia> methodinstances(m)
5-element Vector{Core.MethodInstance}:
 MethodInstance for DataFrames.do_call(::DataFrames.var"#496#497"{var"#11#12"}, ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
 MethodInstance for DataFrames.do_call(::Function, ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
 MethodInstance for DataFrames.do_call(::var"#13#14", ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
 MethodInstance for DataFrames.do_call(::DataFrames.var"#496#497"{var"#15#16"}, ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
 MethodInstance for DataFrames.do_call(::var"#17#18", ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)

but isn't if you run fewer tests. However, apply my @nospecialize, restart your session, and do it again and then this is what you get:

julia> methodinstances(m)
2-element Vector{Core.MethodInstance}:
 MethodInstance for DataFrames.do_call(::Function, ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
 MethodInstance for DataFrames.do_call(::Union{Function, Type}, ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)

This is few enough to ensure that this method of do_call will be called with world-splitting and might actually give some performance advantages.

timholy commented 3 years ago

By the way, when you might want to do "pass through" specialization (outer(f, args...) just passes f down to inner(f, otherargs...) without calling f directly), one way to reduce latency would be to take something like this:

function outer(f, args...)
    for i = 1:n
        # do lots of work (independent of f)
        inner(f, otherargs...)
    end
end

and take the part #do lots of work and move it to a separate @noinline function that doesn't receive f as an argument. That will allow all that work to be compiled many fewer times (without needing to specialize on f, which it doesn't use), and keep outer short and quick-to-compile so that specialization is less painful.

timholy commented 3 years ago

Running this script:

using DataFrames, CategoricalArrays
using SnoopCompile, Profile
using PyPlot: PyPlot, plt

delay=0.1
Profile.init(n=10^7, delay)

tinf = @snoopi_deep include("grouping.jl")
@profile include("grouping.jl")

mref, ax = pgdsgui(tinf; by=exclusive, t0=delay)

from within DataFrames/test I get this:

image

You can also do some analysis:

# To see a list of DataFrames methods that have run times (within our
# poor temporal resolution of 0.1s) less than 1/10th of their inference time
# and an inference time of at least 0.1s:
ridata = runtime_inferencetime(tinf; by=exclusive)
filter(ridata) do pr
    m, rid = pr
    (rid.trun < rid.tinf/10 && rid.tinf > delay) || return false
    return isa(m, Method) ? m.module === DataFrames : occursin("DataFrames", m.file)
end

which yields

29-element Vector{Pair{Union{Method, SnoopCompile.MethodLoc}, SnoopCompile.PGDSData}}:
                                                                                                                                                  transform!(gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s180"} where var"#s180"<:Pair}...; ungroup, renamecols) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:769 => SnoopCompile.PGDSData(0.0, 0.0, 0.32326337400000005, 5)
                                                                                                      var"#transform!#638"(ungroup::Bool, renamecols::Bool, ::typeof(transform!), gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s284"} where var"#s284"<:Pair}...) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:769 => SnoopCompile.PGDSData(0.0, 0.0, 0.124490108, 19)
                                                                                                                                                                                                                                                                                                         (agg::DataFrames.Aggregate{typeof(Statistics.var), C} where C)(incol::AbstractVector{T} where T, gd::GroupedDataFrame) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/fastaggregates.jl:263 => SnoopCompile.PGDSData(0.0, 0.0, 0.17673247900000003, 39)
                                                                                                                                                                                                                                                                                                                                                                                                                         wrap(x) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/callprocessing.jl:25 => SnoopCompile.PGDSData(0.0, 0.0, 0.107918367, 69)
                                                                                                                                                                                                                                                                                                                                      hashrows_col!(h::Vector{UInt64}, n::Vector{Bool}, v::AbstractVector{T} where T, rp, firstcol::Bool) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/utils.jl:21 => SnoopCompile.PGDSData(0.0, 0.0, 0.22342205600000004, 39)
                                                                                                                                                                                                                                                                                                                                           (::DataFrames.var"#select##kw")(::Any, ::typeof(select), df::AbstractDataFrame, args...) in DataFrames at /home/tim/.julia/dev/DataFrames/src/abstractdataframe/selection.jl:940 => SnoopCompile.PGDSData(0.0, 0.0, 0.103290706, 126)
                                                                                                                                                                                                                                                                                                                                                                           (::DataFrames.var"#397#414")(col, refpool, missingind, nmind) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/utils.jl:339 => SnoopCompile.PGDSData(0.0, 0.0, 0.27375607799999996, 51)
                                                                                                                                                                                                                                                                                                                                                                getindex(df::DataFrame, row_ind::Integer, col_ind::Union{Signed, Unsigned}) in DataFrames at /home/tim/.julia/dev/DataFrames/src/dataframe/dataframe.jl:462 => SnoopCompile.PGDSData(0.0, 0.0, 0.206627946, 81)
                                                                                                                                                                                                                                                                                                                                                                                             (::DataFrames.var"#667#669")(k) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:145 => SnoopCompile.PGDSData(0.0, 0.0, 0.135824883, 131)
                                                                                                                                                                                                                                                          var"#fillfirst!#581"(rev::Bool, ::typeof(DataFrames.fillfirst!), condf, outcol::AbstractVector{T} where T, incol::AbstractVector{T} where T, gd::GroupedDataFrame) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:131 => SnoopCompile.PGDSData(0.0, 0.0, 0.29025164800000014, 68)
                                                                                                                                                                                                                                                                                                                                     groupreduce_init(op, condf, adjust, incol::AbstractVector{U}, gd::GroupedDataFrame) where U in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/fastaggregates.jl:77 => SnoopCompile.PGDSData(0.0, 0.0, 0.147357272, 163)
                                                                                                                                    (::DataFrames.var"#select##kw")(::Any, ::typeof(select), gd::GroupedDataFrame, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s179"} where var"#s179"<:Pair}...) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:722 => SnoopCompile.PGDSData(0.0, 0.0, 0.282880664, 98)
                                                                                         var"#select#632"(copycols::Bool, keepkeys::Bool, ungroup::Bool, renamecols::Bool, ::typeof(select), gd::GroupedDataFrame, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s284"} where var"#s284"<:Pair}...) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:722 => SnoopCompile.PGDSData(0.0, 0.0, 0.24843195300000004, 127)
                                                                                                                                                                                                                                                             row_group_slots(cols::Tuple{Vararg{AbstractVector{T} where T, N} where N}, hash::Val, groups::Union{Nothing, Vector{Int64}}, skipmissing::Bool, sort::Union{Nothing, Bool}) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/utils.jl:184 => SnoopCompile.PGDSData(0.0, 0.0, 0.19911604800000016, 177)
                                                                                                                                                                                                                                                                                                                                                   hashrows(cols::Tuple{Vararg{AbstractVector{T} where T, N} where N}, skipmissing::Bool) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/utils.jl:59 => SnoopCompile.PGDSData(0.0, 0.0, 0.2374651519999999, 151)
                                                                                                                                                                                                                                                                                                    view(adf::AbstractDataFrame, rowinds, colinds::Union{Colon, Regex, AbstractVector{T} where T, All, Between, Cols, InvertedIndex}) in DataFrames at /home/tim/.julia/dev/DataFrames/src/subdataframe/subdataframe.jl:136 => SnoopCompile.PGDSData(0.0, 0.0, 0.13630385300000003, 263)
                                                                                                                                                                                                                                                                                                                                                                        _names_match(r1, t::Tuple{Vararg{Symbol, N}}) where N in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:84 => SnoopCompile.PGDSData(0.0, 0.0, 0.18915692899999992, 201)
                                                                                                                                                      do_call(f::Union{Function, Type}, idx::AbstractVector{var"#s285"} where var"#s285"<:Integer, starts::AbstractVector{var"#s284"} where var"#s284"<:Integer, ends::AbstractVector{var"#s180"} where var"#s180"<:Integer, gd::GroupedDataFrame, incols::Nothing, i::Integer) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/callprocessing.jl:137 => SnoopCompile.PGDSData(0.0, 0.0, 0.196004527, 233)
                                                                                                                                                                                                                                                                                                                                                                    SubDataFrame(parent::DataFrame, rows::AbstractVector{Int64}, cols) in DataFrames at /home/tim/.julia/dev/DataFrames/src/subdataframe/subdataframe.jl:76 => SnoopCompile.PGDSData(0.0, 0.0, 0.21583261100000004, 263)
                                                                                                                                                                                       isagg(::Pair{var"#s284", var"#s180"} where {var"#s284"<:Union{AbstractString, Signed, Symbol, Unsigned}, var"#s180"<:(Pair{var"#s179", var"#s55"} where {var"#s179", var"#s55"<:Union{AbstractString, Symbol}})}, gdf::GroupedDataFrame) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/fastaggregates.jl:310 => SnoopCompile.PGDSData(0.0, 0.0, 0.21159961600000013, 310)
                                                                                                                                                                                                                                                                                                                                                                                                    lookupname(l::Dict{Symbol, Int64}, idx::Symbol) in DataFrames at /home/tim/.julia/dev/DataFrames/src/other/index.jl:283 => SnoopCompile.PGDSData(0.0, 0.0, 0.155423984, 473)
                                                                                                                                                                                                                                                                                              (::Core.var"#Type##kw")(::Any, ::Type{DataFrame}, columns::Union{Vector{Any}, Vector{AbstractVector{T} where T}}, colindex::DataFrames.Index) in DataFrames at /home/tim/.julia/dev/DataFrames/src/dataframe/dataframe.jl:170 => SnoopCompile.PGDSData(0.0, 0.0, 0.23239490200000001, 362)
                                                                                                                                                                                                                                                                                                                                                                        getproperty(df::AbstractDataFrame, col_ind::Symbol) in DataFrames at /home/tim/.julia/dev/DataFrames/src/abstractdataframe/abstractdataframe.jl:356 => SnoopCompile.PGDSData(0.0, 0.0, 0.12287589699999993, 703)
 row_group_slots(cols::Tuple{Vararg{AbstractVector{T} where T, N}}, refpools::Tuple{Vararg{AbstractVector{T} where T, N}}, refarrays::Tuple{Vararg{Union{Missings.EachReplaceMissing{var"#s284", U} where {var"#s284"<:(AbstractVector{var"#s180"} where var"#s180"<:Union{Missing, Real}), U}, AbstractVector{var"#s285"} where var"#s285"<:Real}, N}}, hash::Val{false}, groups::Vector{Int64}, skipmissing::Bool, sort::Bool) where N in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/utils.jl:254 => SnoopCompile.PGDSData(0.0, 0.0, 1.061802504, 92)
                                                                                                                                                                                                                                                                              fill_row!(row, outcols::Tuple{Vararg{AbstractVector{T} where T, N}}, i::Integer, colstart::Integer, colnames::Tuple{Vararg{Symbol, N}}) where N in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:93 => SnoopCompile.PGDSData(0.0, 0.0, 0.480759434, 323)
                                                                                                                                                                                                                                                                                                                                                                                                                                                 (::DataFrames.var"#618#624")() in DataFrames at threadingconstructs.jl:169 => SnoopCompile.PGDSData(0.0, 0.0, 0.34025604699999995, 681)
                                                                                                                                                                                                                                           groupreduce!_helper(res::AbstractVector{T} where T, f, op, condf, adjust, checkempty::Bool, incol::AbstractVector{T} where T, groups::Vector{Int64}, counts::Vector{Int64}, batches) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/fastaggregates.jl:160 => SnoopCompile.PGDSData(0.0, 0.0, 1.133820778, 628)
                                                                                                                                                                                                                                                                                                                                                                                             var"#DataFrame#171"(kwargs, ::Type{DataFrame}) in DataFrames at /home/tim/.julia/dev/DataFrames/src/dataframe/dataframe.jl:288 => SnoopCompile.PGDSData(0.0, 0.0, 1.4798948850000004, 725)
                                                                                                                              do_call(f::Union{Function, Type}, idx::AbstractVector{var"#s285"} where var"#s285"<:Integer, starts::AbstractVector{var"#s284"} where var"#s284"<:Integer, ends::AbstractVector{var"#s180"} where var"#s180"<:Integer, gd::GroupedDataFrame, incols::Tuple{AbstractVector{T} where T}, i::Integer) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/callprocessing.jl:90 => SnoopCompile.PGDSData(0.0, 0.0, 1.1230623329999998, 996)

These may be your best candidates for reduced specialization, with the important caveat that a test suite seems unlikely to be a particularly good example of the kind of workload you'd like to optimize for. (It would be better to re-run this with something more realistic.)

Interestingly, the dot with highest inference time (but also significant runtime, nearly 10s of each) is _combine_rows_with_first_task!, so https://github.com/timholy/SnoopCompile.jl/issues/251#issuecomment-886207826 may be relevant.

NHDaly commented 3 years ago

Excellent. Thank you! Please ping me in the PR.

@bkamins: I've opened a PR for it!: https://github.com/timholy/SnoopCompile.jl/pull/252

:) I haven't been working on this kind of stuff in a while, so I would appreciate any help! :)

bkamins commented 3 years ago

Thank you for your comments and tips. I will think about it.

Just two quick remarks:

Can you give me a workload that is guaranteed to call those?

The examples I gave:

select(g,  :n => x -> first(x)); # calls _combine_rows_with_first_task!
select(g,  :n => (n -> n .> mean(n))); # calls _combine_tables_with_first!

function outer(f, args...) for i = 1:n # do lots of work (independent of f) inner(f, otherargs...) end end

Yes, I have thought about it. The problem is that we have:

function outer(f, args...)
   for i = 1:n
       val = inner(f, otherargs...)
       # do lots of work that depends on type of val, which if we moved it
       # to another function would require dynamic dispatch AFAICT
   end
end
timholy commented 3 years ago

to another function would require dynamic dispatch AFAICT

If outer is specialized on f and inner(f, otherargs...) is inferrable, you shouldn't need dynamic dispatch. The key issue is whether the type-diversity of val is lower than the type-diversity of f. Since the number of functions is limitless but the number of value types typically isn't, this might still be worthwhile.

bkamins commented 3 years ago

is inferrable, you shouldn't need dynamic dispatch.

But the compiler when compiling outer does not know the type of val as it is a result of calling inner whose return value depends on f, on which outer is not specialized. So how would it be possible?

I tried to verify your vs my thinking, but I hit the following problem (i.e. it seems I cannot use standard macros to check what happens):

julia> using MethodAnalysis

julia> @noinline inner(f, x) = f(x)
inner (generic function with 1 method)

julia> post(val) = val
post (generic function with 1 method)

julia> function outer(f, x)
           val = inner(f, x)
           return post(val)
       end
outer (generic function with 1 method)

julia> outer(sin, 1)
0.8414709848078965

julia> outer(identity, "a")
"a"

julia> outer(cos, 1.0)
0.5403023058681398

julia> outer(-, 1//1)
-1//1

julia> methodinstances(outer) # UP TO THIS POINT - INCLUSIVE - I UNDERSTAND WHAT IS GOING ON
4-element Vector{Core.MethodInstance}:
 MethodInstance for outer(::Function, ::Int64)
 MethodInstance for outer(::Function, ::String)
 MethodInstance for outer(::Function, ::Float64)
 MethodInstance for outer(::Function, ::Rational{Int64})

julia> @code_warntype outer(sin, 1)
Variables
  #self#::Core.Const(outer)
  f::Core.Const(sin)
  x::Int64
  val::Float64      

Body::Float64       
1 ─      (val = Main.inner(f, x))
│   %2 = Main.post(val)::Float64
└──      return %2

julia> methodinstances(outer) # I DO NOT UNDERSTAND THIS
5-element Vector{Core.MethodInstance}:
 MethodInstance for outer(::Function, ::Int64)
 MethodInstance for outer(::Function, ::String)
 MethodInstance for outer(::Function, ::Float64)
 MethodInstance for outer(::Function, ::Rational{Int64})
 MethodInstance for outer(::typeof(sin), ::Int64)

julia> @time outer(cos, 1)
  0.000001 seconds
0.5403023058681398

julia> methodinstances(outer) # I DO NOT UNDERSTAND THIS EITHER
6-element Vector{Core.MethodInstance}:
 MethodInstance for outer(::Function, ::Int64)
 MethodInstance for outer(::Function, ::String)
 MethodInstance for outer(::Function, ::Float64)
 MethodInstance for outer(::Function, ::Rational{Int64})
 MethodInstance for outer(::typeof(sin), ::Int64)
 MethodInstance for outer(::typeof(cos), ::Int64)

So we seem to have an observer effect. The fact that I used @code_warntype or @time macros seems to change how compilation is performed.

Interestingly - in H2O benchmarks https://h2oai.github.io/db-benchmark/ of course macros are used to measure performance, so the question is if using the macro does not affect the benchmarks :) (this is not a super important question but it just occurred to me so I thought of writing it down). CC @nalimilan

Thank you!

timholy commented 3 years ago

whose return value depends on f, on which outer is not specialized

No, my point is that this is a technique for allowing outer to be specialized on f with less compile cost. Short functions are quick to compile; long ones are not. So keep the body short on anything that just "passes f through" and it should still be OK to let it be specialized.

If you want to force specialization, beware of Julia's heuristics; all those outer(::Function, ...) are unspecialized on the specific function. It's only the typeof(sin) variants that are specialized. When you want to force specialization, do

@noinline inner(f::F, x) where F = f(x)

function outer(f::F, x) where F
...

Adding an apparently-spurious type is kind of like the opposite of adding @nospecialize. For your first methodinstances call I got this:

julia> methodinstances(outer)
4-element Vector{Core.MethodInstance}:
 MethodInstance for outer(::typeof(sin), ::Int64)
 MethodInstance for outer(::typeof(identity), ::String)
 MethodInstance for outer(::typeof(cos), ::Float64)
 MethodInstance for outer(::typeof(-), ::Rational{Int64})

Interesting about the observer effect. Cthulhu shows the same pattern. I guess we're explicitly asking for that specialized form.

bkamins commented 3 years ago

So keep the body short on anything that just "passes f through" and it should still be OK to let it be specialized.

Yes - this is clear. I somehow forgotten that this was the point.

function outer(f::F, x) where F

This is also clear - I wanted to know about the case when specialization is not wanted.

I guess we're explicitly asking for that specialized form.

The most strange thing is that @time also forces specialization.

Thank you again :).