Comparison with Netflix's Metaflow

wlandau commented 6 years ago

Metaflow is an new workflow tool developed by a team at Netflix. From @bcgalvin's presentation at useR 2018, it looks promising, powerful, fast, elegant, and sophisticated. From what I gather, it is not open source yet, but is likely to be down the road.

Some promising things I noticed about Metaflow that drake does not do natively:

Containerization. From what I gather, Metaflow targets run containers automatically. drake can do this with the future backends, but it is not automatic.
Coupling with Amazon's cloud. For example, targets persist on S3, and this setup is designed to retain multiple runs of the same pipeline.
Collaboration on results. This is an explicit focus of Metaflow, which can apparently work with different instances of the same pipeline run by different people.
The ability to expose the pipeline as a REST API. I am not sure if drake + plumber can do this, but I do not expect it to be easy.
The ability to specify unique computing resources for each target (related: #169). I do not believe this will be possible in drake unless we have a parallel backend with transient clustermq workers, expose the template argument to clustermq functions, and skip clustermq's load balancing to manually assign targets to workers. Unfortunately, this is not likely to happen any time soon. (Edit: now possible, https://books.ropensci.org/drake/hpc.html#the-resources-column-for-transient-workers).

I am eager to try out Metaflow, and I am looking forward to finding answers to some of my questions.

How does containerization work in Metaflow? How do I reuse containers across multiple targets and multiple runs of the same pipeline?
How do I generate a Metaflow project with 10000+ targets? I assume there is a way to condense large chains of step()s with a variety of r_functions down to something manageable.
What kinds of graph visuals does Metaflow support?

I am closing this issue until Metaflow is released to the public. When that happens, we will need to rethink drake's place in the landscape of R-focused pipeline management.

wlandau commented 6 years ago

When Metaflow is released, we could also consider interacting with it from drake somehow. At the very least, we could convert drake_plan()s into Metaflow step() %>% step() %>% run() pipelines. Eventually, we might even go as far as make(parallelism = "Metaflow"), but the API converter should probably come first.

harryprince commented 6 years ago

@wlandau when metaflow will release?

wlandau commented 6 years ago

I do not know. I think that is a question for @bcgalvin and his team.

atronchi commented 6 years ago

Re Q1, I'm not sure of the details, but you might guess that titus and/or genie are involved in the container management: • https://netflix.github.io/titus/overview/ • https://netflix.github.io/genie/about/

harryprince commented 5 years ago

@atronchi genie looks like Airflow instead of drake. Airflow is pretty powerful and convenient.

bcgalvin commented 4 years ago

Sorry for the late late response here, but have some happy news: Metaflow has now been open sourced. repo docs.

I'm no longer working at Netflix but wrote the original R interface and will be involved in the open source release of the R api. Would love to know if any drake folks would be interested in this effort. @wlandau

wlandau commented 4 years ago

That's awesome! Congratulations on the release!

From a quick check of Google and social media, it looks like the team needs no help getting the word out, but it still might be worth submitting a PR to https://github.com/pditommaso/awesome-pipeline.

I am trying out some of the tutorials, and even though it has been a long time since I have used Python, the DSL is very easy to understand. Kudos for making it so smooth to install and get started.

My primary goals are to understand Metaflow and to be able to recommend when people should use Metaflow vs when they should use drake. I also want to look for opportunities to use Metaflow and drake together in the same workflow. After reading the docs, I feel acquainted enough now to begin thinking about this, but I will need to learn the R bindings and try out some Metaflow-powered distributed workflows before I can speak with confidence. I look forward to the release of the R API, and I will re-open this issue when it happens.

It seems like Metaflow and drake have different ways of thinking about workflows. From the tutorials, it looks like Metaflow intends to manage multiple complete runs of a project. Each call to python metaflow_script.py run executes all the steps from start to end, and resume clones a previous run and begins from the failed steps. However, resume does not seem to re-execute steps when I change dependencies, e.g. movies.csv in Episode 2. Am I missing something?

drake is primarily designed to manage changes. drake::make() automatically scans R code, targets, data, and configuration info, not only to build the DAG implicitly, but also to decide which targets to skip and which targets to (re)run. In other words, drake always works on the same "run" (with the added capability to track historical output on a target-by-target basis).

Another difference is that drake is not designed for cloud storage. Even in the distributed computing scenarios for which it was designed, it works best when the data store is local. This is one way in which drake and its users could benefit from working in tandem with Metaflow somehow if possible.

wlandau commented 4 years ago

A couple more notes. Re https://github.com/ropensci/drake/issues/472#issuecomment-458410783, a TL;DR for my previous comment might be that Metaflow looks more like Airflow and drake looks more like Make.

Also, re S3 storage, I would like to handle this in drake itself via https://github.com/ropensci/drake/issues/1112.

ropensci / drake

Comparison with Netflix's Metaflow #472