kirillseva / ruigi

Ruigi is a pipeline specialist, much like his python counterpart, Luigi.
42 stars 6 forks source link

Ruigi Build Status Coverage Status Release Tag

Ruigi himself

Ruigi is a pipeline specialist, much like his python counterpart, Luigi.

How to use

Ruigi has two simple concepts that you need to understand in order to use it. The first one is the concept of a target.

A target is an abstraction of an output of a computation that also encloses methods for reading and writing. For example, a .csv file can be a valid target. Here is how it is defined in Ruigi. To use this target you can just define it like

target <- CSVtarget$new("~/Desktop/output.csv")
target$exists() # [1] FALSE
target$write(iris)
target$exists() # [1] TRUE
identical(iris, target$read()) # [1] TRUE

The second abstraction is a task. A task is an abstraction of a confined computation module, which can later become a part of a big computation pipeline. When you define a pipeline, Ruigi will automatically determine the optimal order of execution for the tasks, discover the dependencies, and perform checks to see if you have any cyclic dependencies.

A task is defined by its input targets, the output target, and the computation that needs to be performed on those targets. Note that a task can have 0, 1 or many inputs, but it has to have exactly one output. If the output target for a task exists, the computation will not be run again, saving you time.

By defining separate abstractions for computation modules (tasks), and inspectable outputs (targets), you can have your own library of data processing steps that you can combine into pipelines for different use cases.

Example

# Prepare expample dataset called titanic.csv
download.file("https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv",
              destfile = "./titanic.csv")

# These can be logically organized into folders and then `source`'d
# prior to defining the pipeline.
reader <- ruigi_task$new(
  requires = list(CSVtarget$new("./titanic.csv")),
  target = Rtarget$new("titanic_data"),
  name = "I will read a .csv file and store it on .ruigi_env",
  runner = function(requires, target) {
    out <- requires[[1]]$read()
    target$write(out)
  }
)

writer <- ruigi_task$new(
  requires = list(Rtarget$new("titanic_data")),
  target = CSVtarget$new("./output.csv"),
  name = "I will read a file from RAM and store it in a .csv",
  runner = function(requires, target) {
    out <- requires[[1]]$read()
    target$write(out)
  }
)

# Dependencies will be determined and the tasks will be run.
ruigi::pipeline(list(writer, reader))

# Running task:  I will read a .csv file and store it on .ruigi_env ✓
# Running task:  I will read a file from RAM and store it in a .csv ✓

# No need to run the tasks again, the results already exist
ruigi::pipeline(list(reader, writer))
# Skipping:  I will read a .csv file and store it on .ruigi_env
# Skipping:  I will read a file from RAM and store it in a .csv

Installation

if (!require("devtools")) { install.packages("devtools") }
devtools::install_github("avantcredit/AWS.tools")
devtools::install_github("kirillseva/cacher")
devtools::install_github("robertzk/s3mpi")
devtools::install_github("kirillseva/ruigi")
library(ruigi)

Inspiration

  1. Luigi. A very powerful and widely used python package.
  2. Make. Classic.
  3. Remake. A make alternative for R. If you prefer to write R code as opposed to oneliners in .yml configs you might enjoy using Ruigi!