Open jeremiedb opened 1 year ago
Here are some other functions to benchmark, using DataFramesMeta.jl, which will compile to the same as DataFrames.jl but has nicer syntax (more in line with tidyverse)
julia> using DataFramesMeta
julia> function clean_tt(movies)
@chain tidytable(movies) begin
@filter(Year >= 2000)
@group_by(Year)
@summarize(Budget = mean(Budget, na.rm = TRUE))
@mutate(Budget = Budget/1e6)
collect()
end
end;
julia> function clean_df(movies)
@chain movies begin
subset(:Year => (x -> x .>= 2000))
groupby(:Year)
combine(:Budget => (x -> mean(skipmissing(x))) => :Budget)
transform(:Budget => (x -> x/1e6) => :Budget)
end
end;
julia> function clean_dfm(movies)
@chain movies begin
@rsubset :Year >= 2000
groupby(:Year)
@combine :Budget = mean(skipmissing(:Budget))
@rtransform :Budget = :Budget / 1e6
end
end;
Thanks! I'll review and will fix. As I said on Twitter, this package shouldn't be taken too seriously because it's more of a learning project for me. That said, appreciate the advice on wrapping inside a function for benchmarking purposes.
Is the first run of the function any faster with wrapping in a function? Or only subsequent runs bc of precompilation?
Have read the Julia docs on optimization so I would think first run is still same speed.
I like both the DataFramesMeta and DataFrameMacros syntax. I wish there were a way to try to run a devectorized version of code first and then vectorize when required. While most times I want \@rtransform, if I'm standardizing a variable, as in x - mean(x), then I need to remember to use \@transform. That's more of a Julia style thing (borrowed from Matlab) than anything specific to do with DataFrames.jl. Loving playing with Julia.
Thanks again.
Julia has no way to detect if a transformation should be done col-wise or row-wise, unfortunately. You can't vectorize
a function the way you do in R. But it might make sense to make row-wise the default and introduce @ctransform
Let me add one more comment. This is the way to write this transformation with pure DataFrames.jl that is compiler-friendly (which was the major time cost with the example in README.md):
julia> @time @chain movies begin
subset(:Year => ByRow(>=(2000)), view=true)
groupby(:Year)
combine(:Budget => mean∘skipmissing => :Budget)
transform!(:Budget => Base.Fix2(/, 1e6) => :Budget)
end
0.000462 seconds (670 allocations: 275.789 KiB)
6×2 DataFrame
Row │ Year Budget
│ Int32 Float64
─────┼────────────────
1 │ 2000 23.9477
2 │ 2001 19.2356
3 │ 2002 19.3971
4 │ 2003 15.8683
5 │ 2004 13.9057
6 │ 2005 16.4682
Thank you so much! This is helpful for my learning, and it's very kind of you to take the time to share!
Also as is commented on Slack using filter
is faster than subset
(but I assume you wanted to use subset
).
Following a conversation on Julia's Slack (https://julialang.slack.com/archives/C674VR0HH/p1674245762657489), it was raised that there might be caveats on how the benchmark against DataFrames.jl was conduct.
By wrapping operations into functions, it can be seen that DataFrames.jl is actually significantly outperforming tidytable.