Closed wlandau closed 6 years ago
When Metaflow is released, we could also consider interacting with it from drake
somehow. At the very least, we could convert drake_plan()
s into Metaflow step() %>% step() %>% run()
pipelines. Eventually, we might even go as far as make(parallelism = "Metaflow")
, but the API converter should probably come first.
@wlandau when metaflow will release?
I do not know. I think that is a question for @bcgalvin and his team.
Re Q1, I'm not sure of the details, but you might guess that titus and/or genie are involved in the container management: • https://netflix.github.io/titus/overview/ • https://netflix.github.io/genie/about/
@atronchi genie looks like Airflow
instead of drake
. Airflow is pretty powerful and convenient.
Sorry for the late late response here, but have some happy news: Metaflow has now been open sourced. repo docs.
I'm no longer working at Netflix but wrote the original R interface and will be involved in the open source release of the R api. Would love to know if any drake folks would be interested in this effort. @wlandau
That's awesome! Congratulations on the release!
From a quick check of Google and social media, it looks like the team needs no help getting the word out, but it still might be worth submitting a PR to https://github.com/pditommaso/awesome-pipeline.
I am trying out some of the tutorials, and even though it has been a long time since I have used Python, the DSL is very easy to understand. Kudos for making it so smooth to install and get started.
My primary goals are to understand Metaflow and to be able to recommend when people should use Metaflow vs when they should use drake
. I also want to look for opportunities to use Metaflow and drake
together in the same workflow. After reading the docs, I feel acquainted enough now to begin thinking about this, but I will need to learn the R bindings and try out some Metaflow-powered distributed workflows before I can speak with confidence. I look forward to the release of the R API, and I will re-open this issue when it happens.
It seems like Metaflow and drake
have different ways of thinking about workflows. From the tutorials, it looks like Metaflow intends to manage multiple complete runs of a project. Each call to python metaflow_script.py run
executes all the steps from start to end, and resume
clones a previous run and begins from the failed steps. However, resume
does not seem to re-execute steps when I change dependencies, e.g. movies.csv
in Episode 2. Am I missing something?
drake
is primarily designed to manage changes. drake::make()
automatically scans R code, targets, data, and configuration info, not only to build the DAG implicitly, but also to decide which targets to skip and which targets to (re)run. In other words, drake
always works on the same "run" (with the added capability to track historical output on a target-by-target basis).
Another difference is that drake
is not designed for cloud storage. Even in the distributed computing scenarios for which it was designed, it works best when the data store is local. This is one way in which drake
and its users could benefit from working in tandem with Metaflow somehow if possible.
A couple more notes. Re https://github.com/ropensci/drake/issues/472#issuecomment-458410783, a TL;DR for my previous comment might be that Metaflow looks more like Airflow and drake
looks more like Make.
Also, re S3 storage, I would like to handle this in drake
itself via https://github.com/ropensci/drake/issues/1112.
Metaflow is an new workflow tool developed by a team at Netflix. From @bcgalvin's presentation at useR 2018, it looks promising, powerful, fast, elegant, and sophisticated. From what I gather, it is not open source yet, but is likely to be down the road.
Some promising things I noticed about Metaflow that
drake
does not do natively:drake
can do this with thefuture
backends, but it is not automatic.drake
+plumber
can do this, but I do not expect it to be easy.The ability to specify unique computing resources for each target (related: #169). I do not believe this will be possible in(Edit: now possible, https://books.ropensci.org/drake/hpc.html#the-resources-column-for-transient-workers).drake
unless we have a parallel backend with transientclustermq
workers, expose thetemplate
argument toclustermq
functions, and skipclustermq
's load balancing to manually assign targets to workers. Unfortunately, this is not likely to happen any time soon.I am eager to try out Metaflow, and I am looking forward to finding answers to some of my questions.
step()
s with a variety ofr_function
s down to something manageable.I am closing this issue until Metaflow is released to the public. When that happens, we will need to rethink
drake
's place in the landscape of R-focused pipeline management.