Review benchmarks - Githubissues

jeremiedb commented 1 year ago

Following a conversation on Julia's Slack (https://julialang.slack.com/archives/C674VR0HH/p1674245762657489), it was raised that there might be caveats on how the benchmark against DataFrames.jl was conduct.

By wrapping operations into functions, it can be seen that DataFrames.jl is actually significantly outperforming tidytable.


# base DataFrames.jl
function f0(df)
        _df = subset(df, :Year => (x -> x .>= 2000))
        _df = groupby(_df, :Year)
        _df = combine(_df, :Budget => (x -> mean(skipmissing(x))) => :Budget)
        _df = transform!(_df, :Budget => (x -> x / 1e6) => :Budget)
    return _df
end
# 1.257 ms (903 allocations: 1.78 MiB)
@btime f0($movies)

# chained DataFrames.jl
function f1(df)
  @chain df begin
      subset(:Year => (x -> x .>= 2000))
      groupby(:Year)
      combine(:Budget => (x -> mean(skipmissing(x))) => :Budget)
      transform(:Budget => (x -> x / 1e6) => :Budget)
  end
end
# 1.279 ms (905 allocations: 1.78 MiB)
@btime f1($movies)

# tidytable
function f2(df)
    @chain tidytable(df) begin
        @filter(Year >= 2000)
        @group_by(Year)
        @summarize(Budget = mean(Budget, na.rm = TRUE))
        @mutate(Budget = Budget / 1e6)
        collect()
    end
end
# 26.660 ms (118073 allocations: 5.40 MiB)
@btime f2($movies)

pdeffebach commented 1 year ago

Here are some other functions to benchmark, using DataFramesMeta.jl, which will compile to the same as DataFrames.jl but has nicer syntax (more in line with tidyverse)

julia> using DataFramesMeta 

julia> function clean_tt(movies)
           @chain tidytable(movies) begin
               @filter(Year >= 2000)
               @group_by(Year)
               @summarize(Budget = mean(Budget, na.rm = TRUE))
               @mutate(Budget = Budget/1e6)
               collect()
           end
       end;

julia> function clean_df(movies)
           @chain movies begin
               subset(:Year => (x -> x .>= 2000))
               groupby(:Year)
               combine(:Budget => (x -> mean(skipmissing(x))) => :Budget)
               transform(:Budget => (x -> x/1e6) => :Budget)
           end
       end;

julia> function clean_dfm(movies)
           @chain movies begin
               @rsubset :Year >= 2000
               groupby(:Year)
               @combine :Budget = mean(skipmissing(:Budget))
               @rtransform :Budget = :Budget / 1e6
           end
       end;

kdpsingh commented 1 year ago

Thanks! I'll review and will fix. As I said on Twitter, this package shouldn't be taken too seriously because it's more of a learning project for me. That said, appreciate the advice on wrapping inside a function for benchmarking purposes.

Is the first run of the function any faster with wrapping in a function? Or only subsequent runs bc of precompilation?

Have read the Julia docs on optimization so I would think first run is still same speed.

I like both the DataFramesMeta and DataFrameMacros syntax. I wish there were a way to try to run a devectorized version of code first and then vectorize when required. While most times I want \@rtransform, if I'm standardizing a variable, as in x - mean(x), then I need to remember to use \@transform. That's more of a Julia style thing (borrowed from Matlab) than anything specific to do with DataFrames.jl. Loving playing with Julia.

Thanks again.

pdeffebach commented 1 year ago

Julia has no way to detect if a transformation should be done col-wise or row-wise, unfortunately. You can't vectorize a function the way you do in R. But it might make sense to make row-wise the default and introduce @ctransform

bkamins commented 1 year ago

Let me add one more comment. This is the way to write this transformation with pure DataFrames.jl that is compiler-friendly (which was the major time cost with the example in README.md):

julia> @time @chain movies begin
         subset(:Year => ByRow(>=(2000)), view=true)
         groupby(:Year)
         combine(:Budget => mean∘skipmissing => :Budget)
         transform!(:Budget => Base.Fix2(/, 1e6) => :Budget)
       end
  0.000462 seconds (670 allocations: 275.789 KiB)
6×2 DataFrame
 Row │ Year   Budget
     │ Int32  Float64
─────┼────────────────
   1 │  2000  23.9477
   2 │  2001  19.2356
   3 │  2002  19.3971
   4 │  2003  15.8683
   5 │  2004  13.9057
   6 │  2005  16.4682

kdpsingh commented 1 year ago

Thank you so much! This is helpful for my learning, and it's very kind of you to take the time to share!

bkamins commented 1 year ago

Also as is commented on Slack using filter is faster than subset (but I assume you wanted to use subset).

kdpsingh / TidyTable.jl

Review benchmarks #3