Open bkamins opened 3 years ago
@bkamins you already currently can dig into LLVM optimization and native codegen timings, via @snoopl
and @snoopc
.
We haven't written big blog posts about those yet, I think, but there is at least some information in the documentation site:
Part of the difficultly, though, is that the signal you get from those two measurements is less useful than the signal you get from @snoopi_deep
.
This is because for inference, julia currently performs type inference via a depth-first traversal, so by measuring trace spans, we get a nice flamegraph that you can look at, to understand both inclusive and exclusive inference timings.
For LLVM optimization on the other hand, we pass functions through LLVM optimization in a somewhat random order, batched together arbitrarily, which makes it more difficult to account optimization time to individual functions. @snoopl
currently emits:
module
, and the type signature for the method instances in that module.And finally, I believe @snoopc
provides a mapping of the time to codegen each method instance.
Hope that's helpful! :) If you find any of of that useful, documentation PRs are always welcome ❤️ :)
you already currently can dig into LLVM optimization and native codegen timings, via
@snoopl
and@snoopc
.
The problem is that they start a new Julia session and I need to run them in an already running session so that I do not measure the cost of first compilation. I am interested only in consecutive compilation times. Is this possible?
Reproducible example on Julia nightly:
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.8.0-DEV.262 (2021-07-24)
_/ |\__'_|_|_|\__'_| | Commit 0cbb6850fb (0 days old master)
|__/ |
julia> using DataFrames, SnoopCompile, Statistics
julia> d = DataFrame(n = 1:20, x = [3, 3, 3, 3, 1, 1, 1, 2, 1, 1,
2, 1, 1, 2, 2, 2, 3, 1, 1, 2]);
julia> g = groupby(d, :x);
julia> subset(g, :n => (n -> n .> mean(n))); # compile
julia> select(g, :n => (n -> n .> mean(n))); # compile
julia> @snoopi_deep begin
subset(g, :n => (n -> n .> mean(n)))
end
InferenceTimingNode: 0.117688/0.189961 on Core.Compiler.Timings.ROOT() with 15 direct children
julia> @snoopi_deep begin
subset(g, :n => (n -> n .> mean(n)))
end
InferenceTimingNode: 0.111079/0.182713 on Core.Compiler.Timings.ROOT() with 15 direct children
julia> @snoopi_deep begin
subset(g, :n => (n -> n .> mean(n)))
end
InferenceTimingNode: 0.113283/0.185833 on Core.Compiler.Timings.ROOT() with 15 direct children
julia> @snoopi_deep begin
subset(g, :n => (n -> n .> mean(n)))
end
InferenceTimingNode: 0.112419/0.182522 on Core.Compiler.Timings.ROOT() with 15 direct children
julia> @snoopi_deep begin
select(g, :n => (n -> n .> mean(n)))
end
InferenceTimingNode: 0.085624/0.144820 on Core.Compiler.Timings.ROOT() with 7 direct children
julia> @snoopi_deep begin
select(g, :n => (n -> n .> mean(n)))
end
InferenceTimingNode: 0.082723/0.141071 on Core.Compiler.Timings.ROOT() with 7 direct children
julia> @snoopi_deep begin
select(g, :n => (n -> n .> mean(n)))
end
InferenceTimingNode: 0.081279/0.139118 on Core.Compiler.Timings.ROOT() with 7 direct children
julia> @snoopi_deep begin
select(g, :n => (n -> n .> mean(n)))
end
InferenceTimingNode: 0.081399/0.132495 on Core.Compiler.Timings.ROOT() with 7 direct children
And I am discussing the difference between ~0.11 for subset
and ~0.08 for select
(the first number reported; I do not care now about inference). I would like to know how to diagnose what causes the difference. Thank you!
Ah, I understand! Yes, I fully agree!! I actually have a branch on my laptop that adds support for this, but never pushed it up for some dumb reason...
I'm out for the day, but when I get home tomorrow I'll push the branch up and open a PR, and maybe you can help me wrap it up and get it over the finish line? :)
Basically it's just adding flag to all those functions that control whether to run the measurements in a separate process or in the current one.
And then I had a mega-function, I think I called it @snoo_all
which runs all the analyses together in the same process. :)
Will find it and push it up tomorrow!
Excellent. Thank you! Please ping me in the PR.
There's definitely some weird recompilation going on.
@snoopi_deep begin
select(g, :n => (n -> n .> mean(n)))
end
seems to reinfer
mean(::SubArray{Int64, 1, Vector{Int64}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false})
every time. It doesn't seem to be constant-prop. However,
julia> function mymean(v)
s = zero(eltype(v))
for item in v
s += item
end
return s/length(v)
end
doesn't show exactly the same thing (the broadcasting part gets reinferred, but not mymean
).
(You can display the result in a flamegraph and click on the bars in ProfileView.) You can also find it in print_tree(tinf)
.
I still don't know what's going on at the level of the compiler, but if I make this change:
~/.julia/dev/DataFrames$ git diff
diff --git a/src/groupeddataframe/callprocessing.jl b/src/groupeddataframe/callprocessing.jl
index 740a9df7..d0c079b4 100644
--- a/src/groupeddataframe/callprocessing.jl
+++ b/src/groupeddataframe/callprocessing.jl
@@ -87,7 +87,7 @@ function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
end
end
-function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
+function do_call(@nospecialize(f::Base.Callable), idx::AbstractVector{<:Integer},
starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer)
idx = view(idx, starts[i]:ends[i])
then I cut the time roughly in half and I don't recompile mean
. Inference profile looks like this:
which is a case in which inference is no longer a substantial bottleneck.
But if you do @nospecialize
here we will have a dynamic dispatch inside this function (and do_call
is the only compute intensive part of the whole process).
I think it is better to benchmark the decision about adding @nospecialize
using this source data frame:
d = DataFrame(n = 1:100_000_000, x = repeat(1:1_000_000, 100));
as in it we call do_call
10^6 times.
In this case this indeed improves performance (which is strange): released version timings
julia> @time select(g, :n => (n -> n .> mean(n)));
3.737252 seconds (9.80 M allocations: 2.932 GiB, 19.78% gc time, 5.48% compilation time)
julia> @time select(g, :n => (n -> n .> mean(n)));
3.757369 seconds (9.80 M allocations: 2.932 GiB, 18.96% gc time, 5.96% compilation time)
your proposal with @nospecialize
timings:
julia> @time select(g, :n => (n -> n .> mean(n)));
3.694405 seconds (10.07 M allocations: 2.971 GiB, 16.48% gc time, 3.56% compilation time)
julia> @time select(g, :n => (n -> n .> mean(n)));
3.699847 seconds (10.07 M allocations: 2.971 GiB, 19.24% gc time, 3.44% compilation time)
But if we use a simpler function the situation is reversed: released version timings:
julia> @time select(g, :n => x -> first(x));
2.427769 seconds (187.73 k allocations: 3.751 GiB, 18.02% gc time, 6.28% compilation time)
julia> @time select(g, :n => x -> first(x));
2.475404 seconds (187.72 k allocations: 3.751 GiB, 22.28% gc time, 5.05% compilation time)
your proposal with @nospecialize
timings:
julia> @time select(g, :n => x -> first(x));
2.563807 seconds (2.19 M allocations: 3.840 GiB, 20.53% gc time, 4.47% compilation time)
julia> @time select(g, :n => x -> first(x));
2.664046 seconds (2.19 M allocations: 3.840 GiB, 20.78% gc time, 4.52% compilation time)
Thanks for specifying a case where the performance might matter. I don't think do_call
is such a performance hotspot: it is getting called by dynamic dispatch from within _combine_process_pair
, so I don't think it matters if you just defer that runtime dispatch to something within do_call
. The ideal circumstance is that you delay runtime dispatch until something that is intrinsically expensive; from what I can tell, do_call
just calls f(column_view)
and so it's not doing anything element-wise and so shouldn't be performance-sensitive.
I hadn't noticed until I found the caller just by inserting an error that you're using tasks. That may help me figure out a MWE for the compiler issue.
_combine_process_pair
calls it only once so it does not matter, but it gets called in a hot loop in two other places:
and I would assume that these two other places it matters to avoid dynamic dispatch as in both of them do_call
should be called by static dispatch (and this is what I understand happens).
The call in _combine_process_pair
is only needed to determine the type of result retuned by do_call
so that we can choose a proper code processing path later. And these two other call places of do_call
actually relate to these paths (one is for processing cases when data with multiple rows is returned by do_call
and the other for all cases when do_call
returns one row of data).
If you start Julia with multiple threads these operations are supporting multi-threading indeed.
Can you give me a workload that is guaranteed to call those? Maybe this one does, but if so:
julia> using MethodAnalysis
julia> include("select_subset_test.jl")
julia> mctwf = methods(DataFrames._combine_tables_with_first!).ms[1]
_combine_tables_with_first!(first::Union{AbstractDataFrame, NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}, outcols::Tuple{Vararg{AbstractVector, N}}, idx::Vector{Int64}, rowstart::Integer, colstart::Integer, f::Union{Function, Type}, gd::GroupedDataFrame, incols::Union{Nothing, Tuple, AbstractVector, NamedTuple}, colnames::Tuple{Vararg{Symbol, N}}, firstmulticol::DataFrames.FirstColCount) where N in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:354
julia> methodinstances(mctwf)
3-element Vector{Core.MethodInstance}:
MethodInstance for DataFrames._combine_tables_with_first!(::NamedTuple{(:x1,), Tuple{BitVector}}, ::Tuple{Vector{Bool}}, ::Vector{Int64}, ::Int64, ::Int64, ::Function, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Tuple{Symbol}, ::DataFrames.FirstSingleCol)
MethodInstance for DataFrames._combine_tables_with_first!(::NamedTuple{(:x1,), _A} where _A<:Tuple{AbstractVector}, ::Tuple{AbstractVector}, ::Vector{Int64}, ::Int64, ::Int64, ::Function, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Tuple{Symbol}, ::DataFrames.FirstSingleCol)
MethodInstance for DataFrames._combine_tables_with_first!(::Union{AbstractDataFrame, NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}, ::Tuple{Vararg{AbstractVector, N}}, ::Vector{Int64}, ::Int64, ::Int64, ::Union{Function, Type}, ::GroupedDataFrame{DataFrame}, ::Union{Nothing, Tuple, NamedTuple, AbstractVector}, ::Tuple{Vararg{Symbol, N}}, ::Union{DataFrames.FirstMultiCol, DataFrames.FirstSingleCol}) where N
so it doesn't seem to specialize on f
(easiest to find as the argument after the two Int
s) despite the absence of @nospecialize
(Julia's heuristics are probably at play). But if I'm not really calling it then this could be misleading. I did comment out DataFrames' precompilation directives just to ensure that all specialization happens by natural dispatch. UPDATE: I ran DataFrames "grouping.jl" test script and then got this:
julia> utyps = Base.IdSet(); for mi in methodinstances(mctwf)
push!(utyps, Base.unwrap_unionall(mi.specTypes).parameters[7])
end
julia> utyps
Base.IdSet{Any} with 2 elements:
Function
Union{Function, Type}
There were 153 MethodInstances, so this seems likely to be all the type-diversity for that argument.
For the other, it's definitely not being called in this workload (there are no MethodInstances). UPDATE: after running the tests there are 1343 MethodInstances (wow!) with 317 different types for f
. So this one is being specialized.
Bottom line, if it's not specialized on the specific function, then runtime dispatch gets used. In such cases there should be no cost to deferring it until you get to actual work performed by the function.
In your benchmarking, be careful of potential confusion from world-splitting. If you have 3 or fewer applicable MethodInstances Julia will use world-splitting, but define a 4th method and it will use runtime dispatch. That can distort your estimate of performance hotspots. You may have to decide whether you want to optimize the very first use or the use case where the user has supplied many different functions in a single session. In the demo above do_call
is now over the threshold:
julia> m = methods(DataFrames.do_call).ms[2]
do_call(f::Union{Function, Type}, idx::AbstractVector{<:Integer}, starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/callprocessing.jl:90
julia> methodinstances(m)
5-element Vector{Core.MethodInstance}:
MethodInstance for DataFrames.do_call(::DataFrames.var"#496#497"{var"#11#12"}, ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
MethodInstance for DataFrames.do_call(::Function, ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
MethodInstance for DataFrames.do_call(::var"#13#14", ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
MethodInstance for DataFrames.do_call(::DataFrames.var"#496#497"{var"#15#16"}, ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
MethodInstance for DataFrames.do_call(::var"#17#18", ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
but isn't if you run fewer tests. However, apply my @nospecialize
, restart your session, and do it again and then this is what you get:
julia> methodinstances(m)
2-element Vector{Core.MethodInstance}:
MethodInstance for DataFrames.do_call(::Function, ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
MethodInstance for DataFrames.do_call(::Union{Function, Type}, ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::GroupedDataFrame{DataFrame}, ::Tuple{Vector{Int64}}, ::Int64)
This is few enough to ensure that this method of do_call
will be called with world-splitting and might actually give some performance advantages.
By the way, when you might want to do "pass through" specialization (outer(f, args...)
just passes f
down to inner(f, otherargs...)
without calling f
directly), one way to reduce latency would be to take something like this:
function outer(f, args...)
for i = 1:n
# do lots of work (independent of f)
inner(f, otherargs...)
end
end
and take the part #do lots of work
and move it to a separate @noinline
function that doesn't receive f
as an argument. That will allow all that work to be compiled many fewer times (without needing to specialize on f
, which it doesn't use), and keep outer
short and quick-to-compile so that specialization is less painful.
Running this script:
using DataFrames, CategoricalArrays
using SnoopCompile, Profile
using PyPlot: PyPlot, plt
delay=0.1
Profile.init(n=10^7, delay)
tinf = @snoopi_deep include("grouping.jl")
@profile include("grouping.jl")
mref, ax = pgdsgui(tinf; by=exclusive, t0=delay)
from within DataFrames/test
I get this:
You can also do some analysis:
# To see a list of DataFrames methods that have run times (within our
# poor temporal resolution of 0.1s) less than 1/10th of their inference time
# and an inference time of at least 0.1s:
ridata = runtime_inferencetime(tinf; by=exclusive)
filter(ridata) do pr
m, rid = pr
(rid.trun < rid.tinf/10 && rid.tinf > delay) || return false
return isa(m, Method) ? m.module === DataFrames : occursin("DataFrames", m.file)
end
which yields
29-element Vector{Pair{Union{Method, SnoopCompile.MethodLoc}, SnoopCompile.PGDSData}}:
transform!(gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s180"} where var"#s180"<:Pair}...; ungroup, renamecols) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:769 => SnoopCompile.PGDSData(0.0, 0.0, 0.32326337400000005, 5)
var"#transform!#638"(ungroup::Bool, renamecols::Bool, ::typeof(transform!), gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s284"} where var"#s284"<:Pair}...) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:769 => SnoopCompile.PGDSData(0.0, 0.0, 0.124490108, 19)
(agg::DataFrames.Aggregate{typeof(Statistics.var), C} where C)(incol::AbstractVector{T} where T, gd::GroupedDataFrame) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/fastaggregates.jl:263 => SnoopCompile.PGDSData(0.0, 0.0, 0.17673247900000003, 39)
wrap(x) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/callprocessing.jl:25 => SnoopCompile.PGDSData(0.0, 0.0, 0.107918367, 69)
hashrows_col!(h::Vector{UInt64}, n::Vector{Bool}, v::AbstractVector{T} where T, rp, firstcol::Bool) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/utils.jl:21 => SnoopCompile.PGDSData(0.0, 0.0, 0.22342205600000004, 39)
(::DataFrames.var"#select##kw")(::Any, ::typeof(select), df::AbstractDataFrame, args...) in DataFrames at /home/tim/.julia/dev/DataFrames/src/abstractdataframe/selection.jl:940 => SnoopCompile.PGDSData(0.0, 0.0, 0.103290706, 126)
(::DataFrames.var"#397#414")(col, refpool, missingind, nmind) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/utils.jl:339 => SnoopCompile.PGDSData(0.0, 0.0, 0.27375607799999996, 51)
getindex(df::DataFrame, row_ind::Integer, col_ind::Union{Signed, Unsigned}) in DataFrames at /home/tim/.julia/dev/DataFrames/src/dataframe/dataframe.jl:462 => SnoopCompile.PGDSData(0.0, 0.0, 0.206627946, 81)
(::DataFrames.var"#667#669")(k) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:145 => SnoopCompile.PGDSData(0.0, 0.0, 0.135824883, 131)
var"#fillfirst!#581"(rev::Bool, ::typeof(DataFrames.fillfirst!), condf, outcol::AbstractVector{T} where T, incol::AbstractVector{T} where T, gd::GroupedDataFrame) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:131 => SnoopCompile.PGDSData(0.0, 0.0, 0.29025164800000014, 68)
groupreduce_init(op, condf, adjust, incol::AbstractVector{U}, gd::GroupedDataFrame) where U in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/fastaggregates.jl:77 => SnoopCompile.PGDSData(0.0, 0.0, 0.147357272, 163)
(::DataFrames.var"#select##kw")(::Any, ::typeof(select), gd::GroupedDataFrame, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s179"} where var"#s179"<:Pair}...) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:722 => SnoopCompile.PGDSData(0.0, 0.0, 0.282880664, 98)
var"#select#632"(copycols::Bool, keepkeys::Bool, ungroup::Bool, renamecols::Bool, ::typeof(select), gd::GroupedDataFrame, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s284"} where var"#s284"<:Pair}...) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/splitapplycombine.jl:722 => SnoopCompile.PGDSData(0.0, 0.0, 0.24843195300000004, 127)
row_group_slots(cols::Tuple{Vararg{AbstractVector{T} where T, N} where N}, hash::Val, groups::Union{Nothing, Vector{Int64}}, skipmissing::Bool, sort::Union{Nothing, Bool}) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/utils.jl:184 => SnoopCompile.PGDSData(0.0, 0.0, 0.19911604800000016, 177)
hashrows(cols::Tuple{Vararg{AbstractVector{T} where T, N} where N}, skipmissing::Bool) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/utils.jl:59 => SnoopCompile.PGDSData(0.0, 0.0, 0.2374651519999999, 151)
view(adf::AbstractDataFrame, rowinds, colinds::Union{Colon, Regex, AbstractVector{T} where T, All, Between, Cols, InvertedIndex}) in DataFrames at /home/tim/.julia/dev/DataFrames/src/subdataframe/subdataframe.jl:136 => SnoopCompile.PGDSData(0.0, 0.0, 0.13630385300000003, 263)
_names_match(r1, t::Tuple{Vararg{Symbol, N}}) where N in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:84 => SnoopCompile.PGDSData(0.0, 0.0, 0.18915692899999992, 201)
do_call(f::Union{Function, Type}, idx::AbstractVector{var"#s285"} where var"#s285"<:Integer, starts::AbstractVector{var"#s284"} where var"#s284"<:Integer, ends::AbstractVector{var"#s180"} where var"#s180"<:Integer, gd::GroupedDataFrame, incols::Nothing, i::Integer) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/callprocessing.jl:137 => SnoopCompile.PGDSData(0.0, 0.0, 0.196004527, 233)
SubDataFrame(parent::DataFrame, rows::AbstractVector{Int64}, cols) in DataFrames at /home/tim/.julia/dev/DataFrames/src/subdataframe/subdataframe.jl:76 => SnoopCompile.PGDSData(0.0, 0.0, 0.21583261100000004, 263)
isagg(::Pair{var"#s284", var"#s180"} where {var"#s284"<:Union{AbstractString, Signed, Symbol, Unsigned}, var"#s180"<:(Pair{var"#s179", var"#s55"} where {var"#s179", var"#s55"<:Union{AbstractString, Symbol}})}, gdf::GroupedDataFrame) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/fastaggregates.jl:310 => SnoopCompile.PGDSData(0.0, 0.0, 0.21159961600000013, 310)
lookupname(l::Dict{Symbol, Int64}, idx::Symbol) in DataFrames at /home/tim/.julia/dev/DataFrames/src/other/index.jl:283 => SnoopCompile.PGDSData(0.0, 0.0, 0.155423984, 473)
(::Core.var"#Type##kw")(::Any, ::Type{DataFrame}, columns::Union{Vector{Any}, Vector{AbstractVector{T} where T}}, colindex::DataFrames.Index) in DataFrames at /home/tim/.julia/dev/DataFrames/src/dataframe/dataframe.jl:170 => SnoopCompile.PGDSData(0.0, 0.0, 0.23239490200000001, 362)
getproperty(df::AbstractDataFrame, col_ind::Symbol) in DataFrames at /home/tim/.julia/dev/DataFrames/src/abstractdataframe/abstractdataframe.jl:356 => SnoopCompile.PGDSData(0.0, 0.0, 0.12287589699999993, 703)
row_group_slots(cols::Tuple{Vararg{AbstractVector{T} where T, N}}, refpools::Tuple{Vararg{AbstractVector{T} where T, N}}, refarrays::Tuple{Vararg{Union{Missings.EachReplaceMissing{var"#s284", U} where {var"#s284"<:(AbstractVector{var"#s180"} where var"#s180"<:Union{Missing, Real}), U}, AbstractVector{var"#s285"} where var"#s285"<:Real}, N}}, hash::Val{false}, groups::Vector{Int64}, skipmissing::Bool, sort::Bool) where N in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/utils.jl:254 => SnoopCompile.PGDSData(0.0, 0.0, 1.061802504, 92)
fill_row!(row, outcols::Tuple{Vararg{AbstractVector{T} where T, N}}, i::Integer, colstart::Integer, colnames::Tuple{Vararg{Symbol, N}}) where N in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/complextransforms.jl:93 => SnoopCompile.PGDSData(0.0, 0.0, 0.480759434, 323)
(::DataFrames.var"#618#624")() in DataFrames at threadingconstructs.jl:169 => SnoopCompile.PGDSData(0.0, 0.0, 0.34025604699999995, 681)
groupreduce!_helper(res::AbstractVector{T} where T, f, op, condf, adjust, checkempty::Bool, incol::AbstractVector{T} where T, groups::Vector{Int64}, counts::Vector{Int64}, batches) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/fastaggregates.jl:160 => SnoopCompile.PGDSData(0.0, 0.0, 1.133820778, 628)
var"#DataFrame#171"(kwargs, ::Type{DataFrame}) in DataFrames at /home/tim/.julia/dev/DataFrames/src/dataframe/dataframe.jl:288 => SnoopCompile.PGDSData(0.0, 0.0, 1.4798948850000004, 725)
do_call(f::Union{Function, Type}, idx::AbstractVector{var"#s285"} where var"#s285"<:Integer, starts::AbstractVector{var"#s284"} where var"#s284"<:Integer, ends::AbstractVector{var"#s180"} where var"#s180"<:Integer, gd::GroupedDataFrame, incols::Tuple{AbstractVector{T} where T}, i::Integer) in DataFrames at /home/tim/.julia/dev/DataFrames/src/groupeddataframe/callprocessing.jl:90 => SnoopCompile.PGDSData(0.0, 0.0, 1.1230623329999998, 996)
These may be your best candidates for reduced specialization, with the important caveat that a test suite seems unlikely to be a particularly good example of the kind of workload you'd like to optimize for. (It would be better to re-run this with something more realistic.)
Interestingly, the dot with highest inference time (but also significant runtime, nearly 10s of each) is _combine_rows_with_first_task!
, so https://github.com/timholy/SnoopCompile.jl/issues/251#issuecomment-886207826 may be relevant.
Excellent. Thank you! Please ping me in the PR.
@bkamins: I've opened a PR for it!: https://github.com/timholy/SnoopCompile.jl/pull/252
:) I haven't been working on this kind of stuff in a while, so I would appreciate any help! :)
Thank you for your comments and tips. I will think about it.
Just two quick remarks:
Can you give me a workload that is guaranteed to call those?
The examples I gave:
select(g, :n => x -> first(x)); # calls _combine_rows_with_first_task!
select(g, :n => (n -> n .> mean(n))); # calls _combine_tables_with_first!
function outer(f, args...)
for i = 1:n
# do lots of work (independent of f)
inner(f, otherargs...)
end
end
Yes, I have thought about it. The problem is that we have:
function outer(f, args...)
for i = 1:n
val = inner(f, otherargs...)
# do lots of work that depends on type of val, which if we moved it
# to another function would require dynamic dispatch AFAICT
end
end
to another function would require dynamic dispatch AFAICT
If outer
is specialized on f
and inner(f, otherargs...)
is inferrable, you shouldn't need dynamic dispatch. The key issue is whether the type-diversity of val
is lower than the type-diversity of f
. Since the number of functions is limitless but the number of value types typically isn't, this might still be worthwhile.
is inferrable, you shouldn't need dynamic dispatch.
But the compiler when compiling outer
does not know the type of val
as it is a result of calling inner
whose return value depends on f
, on which outer
is not specialized. So how would it be possible?
I tried to verify your vs my thinking, but I hit the following problem (i.e. it seems I cannot use standard macros to check what happens):
julia> using MethodAnalysis
julia> @noinline inner(f, x) = f(x)
inner (generic function with 1 method)
julia> post(val) = val
post (generic function with 1 method)
julia> function outer(f, x)
val = inner(f, x)
return post(val)
end
outer (generic function with 1 method)
julia> outer(sin, 1)
0.8414709848078965
julia> outer(identity, "a")
"a"
julia> outer(cos, 1.0)
0.5403023058681398
julia> outer(-, 1//1)
-1//1
julia> methodinstances(outer) # UP TO THIS POINT - INCLUSIVE - I UNDERSTAND WHAT IS GOING ON
4-element Vector{Core.MethodInstance}:
MethodInstance for outer(::Function, ::Int64)
MethodInstance for outer(::Function, ::String)
MethodInstance for outer(::Function, ::Float64)
MethodInstance for outer(::Function, ::Rational{Int64})
julia> @code_warntype outer(sin, 1)
Variables
#self#::Core.Const(outer)
f::Core.Const(sin)
x::Int64
val::Float64
Body::Float64
1 ─ (val = Main.inner(f, x))
│ %2 = Main.post(val)::Float64
└── return %2
julia> methodinstances(outer) # I DO NOT UNDERSTAND THIS
5-element Vector{Core.MethodInstance}:
MethodInstance for outer(::Function, ::Int64)
MethodInstance for outer(::Function, ::String)
MethodInstance for outer(::Function, ::Float64)
MethodInstance for outer(::Function, ::Rational{Int64})
MethodInstance for outer(::typeof(sin), ::Int64)
julia> @time outer(cos, 1)
0.000001 seconds
0.5403023058681398
julia> methodinstances(outer) # I DO NOT UNDERSTAND THIS EITHER
6-element Vector{Core.MethodInstance}:
MethodInstance for outer(::Function, ::Int64)
MethodInstance for outer(::Function, ::String)
MethodInstance for outer(::Function, ::Float64)
MethodInstance for outer(::Function, ::Rational{Int64})
MethodInstance for outer(::typeof(sin), ::Int64)
MethodInstance for outer(::typeof(cos), ::Int64)
So we seem to have an observer effect. The fact that I used @code_warntype
or @time
macros seems to change how compilation is performed.
Interestingly - in H2O benchmarks https://h2oai.github.io/db-benchmark/ of course macros are used to measure performance, so the question is if using the macro does not affect the benchmarks :) (this is not a super important question but it just occurred to me so I thought of writing it down). CC @nalimilan
Thank you!
whose return value depends on f, on which outer is not specialized
No, my point is that this is a technique for allowing outer
to be specialized on f
with less compile cost. Short functions are quick to compile; long ones are not. So keep the body short on anything that just "passes f
through" and it should still be OK to let it be specialized.
If you want to force specialization, beware of Julia's heuristics; all those outer(::Function, ...)
are unspecialized on the specific function. It's only the typeof(sin)
variants that are specialized. When you want to force specialization, do
@noinline inner(f::F, x) where F = f(x)
function outer(f::F, x) where F
...
Adding an apparently-spurious type is kind of like the opposite of adding @nospecialize
. For your first methodinstances
call I got this:
julia> methodinstances(outer)
4-element Vector{Core.MethodInstance}:
MethodInstance for outer(::typeof(sin), ::Int64)
MethodInstance for outer(::typeof(identity), ::String)
MethodInstance for outer(::typeof(cos), ::Float64)
MethodInstance for outer(::typeof(-), ::Rational{Int64})
Interesting about the observer effect. Cthulhu shows the same pattern. I guess we're explicitly asking for that specialized form.
So keep the body short on anything that just "passes f through" and it should still be OK to let it be specialized.
Yes - this is clear. I somehow forgotten that this was the point.
function outer(f::F, x) where F
This is also clear - I wanted to know about the case when specialization is not wanted.
I guess we're explicitly asking for that specialized form.
The most strange thing is that @time
also forces specialization.
Thank you again :).
If I understand things correctly SnoopCompile.jl currently allows to dig into inference timings. It would be great to be able to similarly dig into LLVM codegen and native codegen timings.
This is relevant for packages like DataFrames.jl, where in many cases it is LLVM codegen and/or native codegen timing a chief time of running some functions (and this is unavoidable since for many functions users pass anonymous functions as an argument).
An example of such a case is here: https://github.com/JuliaData/DataFrames.jl/issues/2806