Closed wlandau closed 6 years ago
One aspect about the graph which irritated me tremendously when I took part in @krlmr s workshop, was that Functions and files are always on the left hand side, and not at the level where they come in. For example the function knit
comes only in when generating `report.rmd
in the last step - I intuitively would have put it n a level just before that final target and after the targets before. The current arrangement makes sense, but in looking at functions which are used from one target to another, I would expect them with
If a function is used multiple times, the function could be either repeated (risk of cluttering the graph) or just the arrows added (loss of clarity and information).
One aspect about the graph which irritated me tremendously when I took part in @krlmr s workshop, was that Functions and files are always on the left hand side, and not at the level where they come in. For example the function knit comes only in when generating
`report.rmd
in the last step - I intuitively would have put it n a level just before that final target and after the targets before.
The current positioning deliberately shows the general order in which drake
processes things. For most parallel backends, the graph is divided into embarrassingly parallel stages (columns in the graph) that execute in sequence. When we adopt better scheduling algorithms for all backends (https://github.com/ropensci/drake/issues/227, https://github.com/ropensci/drake/issues/285), the execution order will be less deterministic, at which point the graph maybe should no longer try to communicate it in such detail (except maybe that all the imports will still be processed before any of the targets begin). So yes, we should rethink the horizontal arrangement of nodes to avoid those long distances.
looking at functions which are used from one target to another, I would expect them with
- an arrow(s) going in from the target they are receiving
- an arrow going out to the target they are creating
The main purpose of the arrows is to show dependency relationships. Yes, the reg1()
function receives small
as an argument, but small
is not a dependency of reg1()
. In other words, changes to reg1()
should not trigger changes to small
. Finding these dependency relationships and skipping up-to-date work are such crucial ideas for drake
that I am extremely reluctant to change the connections or the directions of the arrows.
If a function is used multiple times, the function could be either repeated (risk of cluttering the graph) or just the arrows added (loss of clarity and information).
If we duplicate nodes this way, each duplicate will no longer be connected to all of its dependencies or reverse dependencies. If you are trying to see all the connections of an imported function, you would need to track down all the duplicates, which I think would be cumbersome and tedious.
Alternatively, we do not need to cling to a single graphical arrangement all the time. Currently, the only graph we have is the dependency graph (same as the schedule graph until https://github.com/ropensci/drake/issues/283 is solved). We could optionally generate a "code graph" or a "call graph" with the relationships you described.
One idea to consider if we stick with igraph
, would be to turn down the opacity for nodes that are not immediately up/downstream when we click on a target. Haven't looked to see if this is actually possible, but it would be a nice way to identify the immediate thread of an target of interest.
Do you mean we should emphasize the extended neighborhood of a selected node instead of just thickening the edges of the order-1 neighborhood? (Kind of like vis_drake_graph(from = target, mode = "all")
vs vis_drake_graph(from = target, mode = "all", order = 1)
?) Absolutely.
By the way, drake
uses igraph
internally for speed but converts it to a visNetwork
for visualization. Here, anything goes when it comes to graphing technology.
The main purpose of the arrows is to show dependency relationships
@wlandau and I think it should stay that way, as it makes sense in the make()
context. If I am not mistaken you suggested to offload the visualization into an additional package, and I think that is the way to go - provide an interface, so that visualizations can be created in an additional package and be added without having to modify drake. The dependency graph should stay to see which targets are outdated, but the others should be offloaded into a suggested package.
Glad we are on the same page. I think the visuals of the dependency graph could also be part of a separate package. Seems like there is a lot more space to develop and experiment that way.
The existing dependency graph is a valuable tool In identifying what is happening during make and to identify why and where things go wrong or targets are outdated. I would definitely keep it in drake. It is much easier for me to understand the dependencies if I see them than jut read them.
Von meinem iPhone gesendet
Am 12.03.2018 um 15:11 schrieb Will Landau notifications@github.com:
Glad we are on the same page. I think the visuals of the dependency graph could also be part of a separate package. Seems like there is a lot more room to develop that way.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
We can import and re-export any functionality we offload. Examples:
devtools
and usethis
dplyr
and magrittr
dplyr
and tidyselect
drake
and tidyselect
drake
and workers
(not yet implemented)I like this approach because it lightens the code base and makes things easier and faster to test and maintain.
Just realized I should elaborate. Let's take magrittr
and dplyr
as an example. The pipe operator is created and exported in the magrittr
package. dplyr
imports %>%
from magrittr
and then re-exports it. That way, %>%
becomes available when you call library(dplyr)
. You don't need to load magrittr
too. Going forward, it would be great to do the same thing with drake
when it comes to visualization and high-performance computing.
The big advantage that I see to the examples that you listed is that: 1: the user only needs to make a single call to library 2: the exposed namespace of the re-imported package is limited to one (a few) functions, helping to avoid conflicts.
This makes sense in the case of the examples you mentioned, because work on those packages is largely independent and orthogonal to each other. However, it seems that drake’s Graphing/visualization facilities aren’t something that can easily import the defaults from other packages, and need to be pretty tightly managed. Maybe there will be an expansion at some point where drake can produce a generic network, which can be passed to a users network visualizor of choice?
dataframes_graph()
supplies the generic network, and vis_drake_graph()
/render_drake_graph()
are much smaller functions by comparison. So perhaps you are right. I think the extended visualizers should be in Suggests:
at the very least.
@rkrug I added alternative graphical arrangements as another project idea. For the call graph, I would think it permissible to repeat mentions of imported functions because the dependency graph is something else entirely. The only issue I see is clutter.
If I can add to the clustering/condensed graphs point: it would be nice to have targets created by evaluate_plan()
optionally be condensed into one node on the graph. Maybe this is something that can be done more cleanly with the DSL (ropensci/drake/issues/233), but there should be a way to hack it together with our current regime.
Given a plan object like
rules <- list(i__ = 1:10)
plan <- tribble(
~target, ~command,
"x", "rnorm(i__)",
"y", "exp(x_i__)"
) %>% evaluate_plan(rules)
One should be able to pass the rules into the graphing function or something
vis_drake_graph(config, rules)
Then the code would group all x
under one node called x_i__
and all y
under a node y_i__
. These nodes don't necessarily have to be expandable clusters. Perhaps the code would just do some naive text manipulation with target names to achieve this result.
I think this is the most natural way to think about clusters of targets. Unfortunately, it may be out of scope for the unconference because we don't have a DSL yet, but I believe it is where we should aim.
Couldn't evaluate_plan
add the rules to the grouping column discussed in ropensci/drake#229 ?
I suppose it could, and it would make the nested clusters you suggested fall into place naturally. If we go forward with a group
column for evaluate_plan()
and friends, I think we should align on the future of the existing wildcard templating interface. @krlmlr expressed a preference to deprecate and remove it when we have the DSL. Personally, I would prefer to keep both interfaces because I think they can coexist without friction. But that is a discussion for another thread (perhaps https://github.com/ropensci/drake/issues/240).
@AlexAxthelm re: opacity I assume this would be possible with the distances functionality in igraph, either to define a cluster around a node within a given number of links, or to highlight only things that are up/downstream of some selected node. and/or if edges have properties, maybe a user could choose some to just turn off if the network is cluttered?
apologies if this was already addressed elsewhere, I am new to getting caught up learning about how cool drake
is!
What it's worth, drake
has a brand new deps_targets()
function that can list all the nodes immediately upstream or immediately downstream in the dependency graph.
library(drake)
load_mtcars_example()
config <- drake_config(my_plan)
deps_targets(targets = c("small", "large"), config = config)
#> [1] "simulate"
deps_targets(targets = c("small", "large"), config = config, reverse = TRUE)
#> [1] "regression1_large" "regression1_small" "regression2_large"
#> [4] "regression2_small" "\"report.md\""
See the graph for that example here (from vis_drake_graph(config)
).
I think the bigger challenge is to write the JavaScript for visNetwork
to micromanage the click and/or hover events to actually render the opacity. Then again, I have not seriously programmed in JavaScript for 5 years, and I was never really that good at it.
Anyway, I have been asked to close this thread. Unfortunately, I cannot physically be at the unconf, and most of the commenters on this thread are not attending either, so it would be difficult to make this project work on May 21-22. But let's talk more at https://github.com/ropensci/drake/issues/229 and especially https://github.com/ropensci/drake/issues/282.
Current capabilities
As with many similar reproducible pipeline toolkits, the drake package can display the dependency networks of declarative workflows.
The
visNetwork
package powers interactivity behind the scenes. Click here for the true, interactive version of the above screenshot. There, you can hover, click, drag, zoom, and pan to explore the graph.Start fresh and customize!
Using the
dataframes_graph()
function, you can directly access the network data, including the nodes, edges, and relevant metadata. That means you can create your own custom visualizations without needing to developdrake
itself. You can start from a clean slate and create your own fresh tool.Unconf18 projects ideas
Condensed graphs
Ref: https://github.com/ropensci/drake/issues/229. Network graphs of large workflows are cumbersome. Even with interactivity, graphs with hundreds of nodes are difficult to understand, and larger ones can max out a computer's memory and lag. Condensed graphs could potentially respond faster and more easily guide intuition. There are multiple approaches for simplifying, clustering, and downsizing. Examples:
EDIT: from https://github.com/ropensci/drake/issues/229#issuecomment-372308031), base
drake
is likely to support a rudimentary form of clustering. But a separate tool could account for nested groupings, and ashiny
app could allow users to assign nodes to clusters interactively.Static graphs
Ref: https://github.com/ropensci/drake/issues/279. To print a
visNetwork
, you can either take a screenshot or export a file from RStudio's viewer pane. Either way, you need to go through a point-and-click tool or one the screenshot tools @maelle mentioned in #11.Drake
cannot yet create static images on its own, and such images could be crisper than screenshots and would enhance reproducible examples.Workflow plan generation
In
drake
, the declarative outline of a workflow is a data frame of targets and commands.The
make()
function resolves the dependency network and builds the targets.Currently, users need to write code to construct workflow plans. (See
drake_plan()
, wildcard templating, and https://github.com/ropensci/drake/issues/233)). To begin a large project project, I usually need to iterate betweendrake_plan()
andvis_drake_graph()
several times before all the nodes connect properly. Ashiny
app could interactively build an already-connected workflow graph and then generate a matching plan formake()
.Alternative graphical arrangements (re: https://github.com/ropensci/unconf18/issues/12#issuecomment-372220250)
The default graphical arrangement in
drake
can be counter-intuitive. The dependency graph shows how the targets and imports depend on each other, which is super important, but it is not necessarily the order in which these objects are used chronologically. For example, in this network fromvis_drake_graph()
, thereg1()
function appears upstream fromsmall
even thoughreg1()
takessmall
as an argument to buildregression1_small
. An optional "code graph" or "call graph" could better demonstrate the flow of execution duringmake()
.Final (initial?) thoughts
Drake
stands out from its many peers with its intense focus on R. R stands out because of its strong community and visualization power. Collaboration on visuals will really helpdrake
shine and hopefully improve reproducible research.cc @krlmlr, @AlexAxthelm, @dapperjapper, @kendonB, @rkrug