ropensci / unconf18

http://unconf18.ropensci.org/
44 stars 4 forks source link

Improved visualization for drake #12

Closed wlandau closed 6 years ago

wlandau commented 6 years ago

Current capabilities

As with many similar reproducible pipeline toolkits, the drake package can display the dependency networks of declarative workflows.

devtools::install_github("ropensci/drake")
library(drake)
load_basic_example() # Call make(my_plan) to run the project.
config <- drake_config(my_plan)
vis_drake_graph(config)

screenshot_20180310_003428

The visNetwork package powers interactivity behind the scenes. Click here for the true, interactive version of the above screenshot. There, you can hover, click, drag, zoom, and pan to explore the graph.

Start fresh and customize!

Using the dataframes_graph() function, you can directly access the network data, including the nodes, edges, and relevant metadata. That means you can create your own custom visualizations without needing to develop drake itself. You can start from a clean slate and create your own fresh tool.

Unconf18 projects ideas

Condensed graphs

Ref: https://github.com/ropensci/drake/issues/229. Network graphs of large workflows are cumbersome. Even with interactivity, graphs with hundreds of nodes are difficult to understand, and larger ones can max out a computer's memory and lag. Condensed graphs could potentially respond faster and more easily guide intuition. There are multiple approaches for simplifying, clustering, and downsizing. Examples:

EDIT: from https://github.com/ropensci/drake/issues/229#issuecomment-372308031), base drake is likely to support a rudimentary form of clustering. But a separate tool could account for nested groupings, and a shiny app could allow users to assign nodes to clusters interactively.

Static graphs

Ref: https://github.com/ropensci/drake/issues/279. To print a visNetwork, you can either take a screenshot or export a file from RStudio's viewer pane. Either way, you need to go through a point-and-click tool or one the screenshot tools @maelle mentioned in #11. Drake cannot yet create static images on its own, and such images could be crisper than screenshots and would enhance reproducible examples.

Workflow plan generation

In drake, the declarative outline of a workflow is a data frame of targets and commands.

load_basic_example()
head(my_plan)

## # A tibble: 6 x 2
##   target            command                                                                      
##   <chr>             <chr>                                                                        
## 1 ""                "knit(knitr_in(\"report.Rmd\"), file_out(\"report.md\"), quiet = TRUE)"
## 2 small             simulate(48)                                                                 
## 3 large             simulate(64)                                                                 
## 4 regression1_small reg1(small)                                                                  
## 5 regression1_large reg1(large)                                                                  
## 6 regression2_small reg2(small)  

The make() function resolves the dependency network and builds the targets.

make(my_plan) 
## target large
## target small
## target regression1_large
## target regression1_small
## target regression2_large
## ...

Currently, users need to write code to construct workflow plans. (See drake_plan(), wildcard templating, and https://github.com/ropensci/drake/issues/233)). To begin a large project project, I usually need to iterate between drake_plan() and vis_drake_graph() several times before all the nodes connect properly. A shiny app could interactively build an already-connected workflow graph and then generate a matching plan for make().

Alternative graphical arrangements (re: https://github.com/ropensci/unconf18/issues/12#issuecomment-372220250)

The default graphical arrangement in drake can be counter-intuitive. The dependency graph shows how the targets and imports depend on each other, which is super important, but it is not necessarily the order in which these objects are used chronologically. For example, in this network from vis_drake_graph(), the reg1() function appears upstream from small even though reg1() takes small as an argument to build regression1_small. An optional "code graph" or "call graph" could better demonstrate the flow of execution during make().

Final (initial?) thoughts

Drake stands out from its many peers with its intense focus on R. R stands out because of its strong community and visualization power. Collaboration on visuals will really help drake shine and hopefully improve reproducible research.

cc @krlmlr, @AlexAxthelm, @dapperjapper, @kendonB, @rkrug

rkrug commented 6 years ago

One aspect about the graph which irritated me tremendously when I took part in @krlmr s workshop, was that Functions and files are always on the left hand side, and not at the level where they come in. For example the function knit comes only in when generating `report.rmd in the last step - I intuitively would have put it n a level just before that final target and after the targets before. The current arrangement makes sense, but in looking at functions which are used from one target to another, I would expect them with

If a function is used multiple times, the function could be either repeated (risk of cluttering the graph) or just the arrows added (loss of clarity and information).

wlandau commented 6 years ago

One aspect about the graph which irritated me tremendously when I took part in @krlmr s workshop, was that Functions and files are always on the left hand side, and not at the level where they come in. For example the function knit comes only in when generating `report.rmd in the last step - I intuitively would have put it n a level just before that final target and after the targets before.

The current positioning deliberately shows the general order in which drake processes things. For most parallel backends, the graph is divided into embarrassingly parallel stages (columns in the graph) that execute in sequence. When we adopt better scheduling algorithms for all backends (https://github.com/ropensci/drake/issues/227, https://github.com/ropensci/drake/issues/285), the execution order will be less deterministic, at which point the graph maybe should no longer try to communicate it in such detail (except maybe that all the imports will still be processed before any of the targets begin). So yes, we should rethink the horizontal arrangement of nodes to avoid those long distances.

looking at functions which are used from one target to another, I would expect them with

  • an arrow(s) going in from the target they are receiving
  • an arrow going out to the target they are creating

The main purpose of the arrows is to show dependency relationships. Yes, the reg1() function receives small as an argument, but small is not a dependency of reg1(). In other words, changes to reg1() should not trigger changes to small. Finding these dependency relationships and skipping up-to-date work are such crucial ideas for drake that I am extremely reluctant to change the connections or the directions of the arrows.

If a function is used multiple times, the function could be either repeated (risk of cluttering the graph) or just the arrows added (loss of clarity and information).

If we duplicate nodes this way, each duplicate will no longer be connected to all of its dependencies or reverse dependencies. If you are trying to see all the connections of an imported function, you would need to track down all the duplicates, which I think would be cumbersome and tedious.

wlandau commented 6 years ago

Alternatively, we do not need to cling to a single graphical arrangement all the time. Currently, the only graph we have is the dependency graph (same as the schedule graph until https://github.com/ropensci/drake/issues/283 is solved). We could optionally generate a "code graph" or a "call graph" with the relationships you described.

AlexAxthelm commented 6 years ago

One idea to consider if we stick with igraph, would be to turn down the opacity for nodes that are not immediately up/downstream when we click on a target. Haven't looked to see if this is actually possible, but it would be a nice way to identify the immediate thread of an target of interest.

wlandau commented 6 years ago

Do you mean we should emphasize the extended neighborhood of a selected node instead of just thickening the edges of the order-1 neighborhood? (Kind of like vis_drake_graph(from = target, mode = "all") vs vis_drake_graph(from = target, mode = "all", order = 1)?) Absolutely.

By the way, drake uses igraph internally for speed but converts it to a visNetwork for visualization. Here, anything goes when it comes to graphing technology.

rkrug commented 6 years ago

The main purpose of the arrows is to show dependency relationships

@wlandau and I think it should stay that way, as it makes sense in the make() context. If I am not mistaken you suggested to offload the visualization into an additional package, and I think that is the way to go - provide an interface, so that visualizations can be created in an additional package and be added without having to modify drake. The dependency graph should stay to see which targets are outdated, but the others should be offloaded into a suggested package.

wlandau commented 6 years ago

Glad we are on the same page. I think the visuals of the dependency graph could also be part of a separate package. Seems like there is a lot more space to develop and experiment that way.

rkrug commented 6 years ago

The existing dependency graph is a valuable tool In identifying what is happening during make and to identify why and where things go wrong or targets are outdated. I would definitely keep it in drake. It is much easier for me to understand the dependencies if I see them than jut read them.

Von meinem iPhone gesendet

Am 12.03.2018 um 15:11 schrieb Will Landau notifications@github.com:

Glad we are on the same page. I think the visuals of the dependency graph could also be part of a separate package. Seems like there is a lot more room to develop that way.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

wlandau commented 6 years ago

We can import and re-export any functionality we offload. Examples:

I like this approach because it lightens the code base and makes things easier and faster to test and maintain.

wlandau commented 6 years ago

Just realized I should elaborate. Let's take magrittr and dplyr as an example. The pipe operator is created and exported in the magrittr package. dplyr imports %>% from magrittr and then re-exports it. That way, %>% becomes available when you call library(dplyr). You don't need to load magrittr too. Going forward, it would be great to do the same thing with drake when it comes to visualization and high-performance computing.

AlexAxthelm commented 6 years ago

The big advantage that I see to the examples that you listed is that: 1: the user only needs to make a single call to library 2: the exposed namespace of the re-imported package is limited to one (a few) functions, helping to avoid conflicts.

This makes sense in the case of the examples you mentioned, because work on those packages is largely independent and orthogonal to each other. However, it seems that drake’s Graphing/visualization facilities aren’t something that can easily import the defaults from other packages, and need to be pretty tightly managed. Maybe there will be an expansion at some point where drake can produce a generic network, which can be passed to a users network visualizor of choice?

wlandau commented 6 years ago

dataframes_graph() supplies the generic network, and vis_drake_graph()/render_drake_graph() are much smaller functions by comparison. So perhaps you are right. I think the extended visualizers should be in Suggests: at the very least.

wlandau commented 6 years ago

@rkrug I added alternative graphical arrangements as another project idea. For the call graph, I would think it permissible to repeat mentions of imported functions because the dependency graph is something else entirely. The only issue I see is clutter.

violetcereza commented 6 years ago

If I can add to the clustering/condensed graphs point: it would be nice to have targets created by evaluate_plan() optionally be condensed into one node on the graph. Maybe this is something that can be done more cleanly with the DSL (ropensci/drake/issues/233), but there should be a way to hack it together with our current regime.

Given a plan object like

rules <- list(i__ = 1:10)
plan <- tribble(
  ~target,   ~command,    
  "x",       "rnorm(i__)",
  "y",       "exp(x_i__)"
) %>% evaluate_plan(rules)

One should be able to pass the rules into the graphing function or something

vis_drake_graph(config, rules)

Then the code would group all x under one node called x_i__ and all y under a node y_i__. These nodes don't necessarily have to be expandable clusters. Perhaps the code would just do some naive text manipulation with target names to achieve this result.

wlandau commented 6 years ago

I think this is the most natural way to think about clusters of targets. Unfortunately, it may be out of scope for the unconference because we don't have a DSL yet, but I believe it is where we should aim.

AlexAxthelm commented 6 years ago

Couldn't evaluate_plan add the rules to the grouping column discussed in ropensci/drake#229 ?

wlandau commented 6 years ago

I suppose it could, and it would make the nested clusters you suggested fall into place naturally. If we go forward with a group column for evaluate_plan() and friends, I think we should align on the future of the existing wildcard templating interface. @krlmlr expressed a preference to deprecate and remove it when we have the DSL. Personally, I would prefer to keep both interfaces because I think they can coexist without friction. But that is a discussion for another thread (perhaps https://github.com/ropensci/drake/issues/240).

khondula commented 6 years ago

@AlexAxthelm re: opacity I assume this would be possible with the distances functionality in igraph, either to define a cluster around a node within a given number of links, or to highlight only things that are up/downstream of some selected node. and/or if edges have properties, maybe a user could choose some to just turn off if the network is cluttered?

apologies if this was already addressed elsewhere, I am new to getting caught up learning about how cool drake is!

wlandau commented 6 years ago

What it's worth, drake has a brand new deps_targets() function that can list all the nodes immediately upstream or immediately downstream in the dependency graph.

library(drake)
load_mtcars_example()
config <- drake_config(my_plan)
deps_targets(targets = c("small", "large"), config = config)
#> [1] "simulate"
deps_targets(targets = c("small", "large"), config = config, reverse = TRUE)
#> [1] "regression1_large" "regression1_small" "regression2_large"
#> [4] "regression2_small" "\"report.md\""

See the graph for that example here (from vis_drake_graph(config)).

I think the bigger challenge is to write the JavaScript for visNetwork to micromanage the click and/or hover events to actually render the opacity. Then again, I have not seriously programmed in JavaScript for 5 years, and I was never really that good at it.

wlandau commented 6 years ago

Anyway, I have been asked to close this thread. Unfortunately, I cannot physically be at the unconf, and most of the commenters on this thread are not attending either, so it would be difficult to make this project work on May 21-22. But let's talk more at https://github.com/ropensci/drake/issues/229 and especially https://github.com/ropensci/drake/issues/282.