jkrumbiegel / DataFrameMacros.jl

Macros that simplify working with DataFrames.jl
MIT License
61 stars 4 forks source link

DataFrameMacros.jl

CI Testing

Intro

DataFrames.jl has a special mini-language for data transformations, which is powerful but often verbose. Here's an example:

transform(df, :A => ByRow(x -> x + 1) => :B)

With DataFrameMacros.jl, you don't have to separately specify the input columns, the transformation function, and the output columns. You can just write it as a normal expression which is transformed to the mini-language for you:

@transform(df, :B = :A + 1)

DataFrameMacros.jl also helps you when you have to transform many columns in a similar way. Every expression in DataFrameMacros.jl is automatically executed once for every column in a multi-column specifier, such as All, Between, Not or regular expressions like r"cat|dog".

Here's how you could divide all columns of a DataFrame that start with "a" by 10 and add the suffix _div_10 to each new name:

@select(df, "{}_div_10" = {r"^a"} / 10)

To execute more complex name transformations on the left-hand side, you can use {n} expressions. For example, to concatenate each first column name with only the last split off part of each second column name in a multi-column transformation, you could write:

@select(df, {1} * split({2})[end] = {first_set_of_cols} + {second_set_of_cols})

You can also use multiple columns together as a Tuple with the double brace syntax, which is useful when you need to run an aggregation over those columns in an expression. In this example we keep all rows where the value in the :January column is larger than the median from :February to :December:

@subset(df, :January > median({{ Between(:February, :December) }}))

API

DataFrameMacros.jl exports these macros:

DataFrameMacros.jl compared to DataFramesMeta.jl

DataFramesMeta.jl is another package that provides macros for DataFrames.jl. The syntax looks similar in many cases, but here are some noteworthy differences:

If any of these points have changed, please open an issue.

Tip

You can use the separate package Chain.jl for writing chains of transformations, this way you don't have to repeat the DataFrame argument every time. This is similar to the tidyverse piping syntax that you might know from R:

@chain df begin
    @subset(:A > 50)
    @transform(:B = :A + :C)
    @groupby(iseven(:B), :D)
    @combine(:mean_E = mean(:E))
    @sort(abs(:F))
end