Douglass.jl is a package for manipulating DataFrames in Julia using a syntax that is very similar to Stata.
Note: Douglass.jl is in alpha, and may contain bugs. Please do try it out and report your experience. When using it in production, please check that the output is correct.
Douglass is not registered. To install, type ]
in the Julia command prompt, followed by
add https://github.com/jmboehm/Douglass.jl.git
using Douglass, RDatasets, DataFrames, DataFramesMeta
df = dataset("datasets", "iris")
# set the active DataFrame
Douglass.set_active_df(:df)
# create a variable `z` that is the sum of `SepalLength` and `SepalWidth`, for each row
d"gen :z = :SepalLength + :SepalWidth"
# replace `z` by the row index for the first 10 observations
d"replace :z = _n if _n <= 10"
# drop a variable
d"drop :z"
# construct the within-group sum for a subset of the observations
d"bysort :Species : egen :z = sum(:SepalLength) if :SepalWidth .> 3.0"
generate
-- Creates a new variable and assigns the output from an expression to it.replace
-- Recplaces the content of a variable, but does not change the type.egenerate
(or egen
for short) -- Creates a new variable. Operates on vectors.ereplace
(or erep
for short) -- Analogous to egen
, replaces values of existing variables.drop
-- Drops the specified observations (if used in conjunction with if
) or variables (without if
)rename
-- Rename a variablesort
-- Sort the rows activate DataFrame
by the specified columnsreshape
-- Reshape the activate DataFrame
between wide and long format (reshape_long
, reshape_wide
)merge
-- Merge the active DataFrame
with another one in the local scope (merge_m1
, merge_1m
, merge_11
)duplicates_drop
-- Delete duplicate rows, also by subset of columnsSee the commands documentation page for more details on syntax of these commands.
Press the backtick (`
) to switch between the normal Julia REPL and the Douglass REPL mode:
Douglass supports multiline input on the active dataframe:
d"""
gen :x = 5
gen :y = 6
"""
The @douglass
macro allows subsequent operations to be performed on one particular DataFrame:
using RDatasets
iris = dataset("datasets", "iris")
Douglass.@douglass iris """
gen :x = :SepalWidth + :PetalWidth
gen :y = 42
"""
These benchmarks are made using a synthetic dataset with 1m observations, on my Macbook Pro (Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz, Julia 1.9.0, Stata/MP 17.0).
Please file bug reports as issues.
If you find the package useful or the idea promising, please consider giving it a star (at the top of the page).
Douglass.jl is named in honour of the economic historian Douglass North.