Volcanito is an attempt to start standardizing the user-facing API that tables expose in Julia. Because that task is too ambitious for one person writing code in spurts every few months, the project is starting with something less ambitious:
Standardize on a set of user-facing macros that define primitive operations on tables:
@select
@where
@group_by
@aggregate_vector
@order_by
@limit
@inner_join
@left_join
@right_join
@outer_join
Lower those user-facing macros to objects that lazily represent those operations and can be used to build a simplified logical plan:
Select
Where
GroupBy
AggregateVector
OrderBy
Limit
Join
Define a basic implementation of how to carry out the logical plan in terms of primitive operations on DataFrames from DataFrames.jl.
For more details, see docs/architecture.md.
Volcanito is a project that I started to explore a few areas in the Julia data tools design space:
import Pkg
Pkg.activate(".")
import DataFrames: DataFrame
import Statistics: mean
import Volcanito:
@select,
@where,
@group_by,
@aggregate_vector,
@order_by,
@limit,
@inner_join
df = DataFrame(
a = rand(10_000),
b = rand(10_000),
c = rand(Bool, 10_000),
)
@select(df, a, b, d = a + b)
@where(df, a > b)
@aggregate_vector(
@group_by(df, !c),
m_a = mean(a),
m_b = mean(b),
n_a = length(a),
n_b = length(b),
)
@order_by(df, a + b)
@limit(df, 10)
@inner_join(
a = df,
b = @aggregate_vector(
@group_by(df, c),
m_a = mean(a),
m_b = mean(b),
n_a = length(a),
n_b = length(b),
),
a.c == b.c,
)
@aggregate_vector(df, m = mean(a))
To make it easier to understand how things work, the examples above all exploit
the fact that Volcanito's user-facing macros construct LogicalNode
objects
that automatically materialize the result of a query whenever Base.show
is
called. This makes it seem as if the user-facing macros operate eagerly, but
the truth is that they operate lazily and produce LogicalNode
objects rather
than DataFrames. If you want to transform a LogicalNode
object into a full
DataFrame, you should explicitly call Volcanito.materialize
.
import Pkg
Pkg.activate(".")
import DataFrames: DataFrame
import Volcanito:
@select,
materialize
df = DataFrame(
a = rand(10_000),
b = rand(10_000),
c = rand(Bool, 10_000),
)
plan = @select(df, a, b, d = a + b)
typeof(plan)
df = materialize(plan)
typeof(df)
To simplify working with data, the macros involve rewrite passes to automate several tedious users otherwise do manually.
Three-valued logic works even with short-circuiting Boolean operators:
import Pkg
Pkg.activate(".")
import DataFrames: DataFrame
import Volcanito: @where
df = DataFrame(
a = [missing, 0.25, 0.5, 0.75],
b = [missing, 0.75, 0.5, 0.25],
)
function f(x)
println("Calling f(x) on x = $x")
x + 1
end
@where(df, f(a) > 1.5 && f(b) >= 1.25)
Local scalar variables can be interpolated/spliced into expressions:
import Pkg
Pkg.activate(".")
import DataFrames: DataFrame
import Volcanito: @where
df = DataFrame(
a = [missing, 0.25, 0.5, 0.75],
b = [missing, 0.75, 0.5, 0.25],
)
let x = 0.5
@where(df, a >= $x)
end
As in SQL, Volcanito allows backticks to be used to indicate that an otherwise invalid identifier is a column name. This can be used when column names are derived from an expression without an alias:
import Pkg
Pkg.activate(".")
import DataFrames: DataFrame
import Volcanito: @select, @aggregate_vector
import Statistics: mean
df = DataFrame(
a = rand(10_000),
b = rand(10_000),
)
@select(
@aggregate_vector(df, mean(a)),
`mean(a)` + 1,
)
This trick means that the normal Julia syntax for generating a Cmd
object is
not available: use the @cmd
macro instead to achieve the same effect.
One challenge with metaprogramming approaches like Volcanito employs is that it can be difficult to use these techniques in functions in which the column names to be computed aginst are not known statically. To address this, Volcanito further coopts backtick syntax and combines it with interpolation syntax to make it possible to indicate that column names are dynamic and only known at runtime. An example of using this capacity in a function is shown below:
import Pkg
Pkg.activate(".")
import DataFrames: DataFrame
import Volcanito: @select, materialize
df = DataFrame(
a = rand(10_000),
b = rand(10_000),
)
function add_columns(df, x, y)
@select(df, new_col = `$x` + `$y`)
end
add_columns(df, :a, :b)
isequal(
materialize(@select(df, new_col = a + b)),
materialize(add_columns(df, :a, :b)),
)