ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

Group together related commands in the graph visualization #229

Closed krlmlr closed 6 years ago

krlmlr commented 6 years ago

Can visNetwork visually group related commands (manually specified by the user) in a subgraph-like setting?

From https://graphviz.org:

screenshot from 2018-02-04 07-44-16

wlandau commented 6 years ago

How should we choose the groupings? My intuition tells me that graph theory has a straightforward answer somewhere.

wlandau commented 6 years ago

233 might preserve the patterns that expanded commands came from, which would help with grouping related commands here.

wlandau commented 6 years ago

Could features like this one bud into their own drake-focused visualization package? I believe drake should natively support basic network visualizations, but the possibilities are endless, and the code base will likely be long, complicated and difficult to test.

krlmlr commented 6 years ago

I was looking only for manual grouping, perhaps with a new column in the plan data frame?

wlandau commented 6 years ago

That sounds much easier.

wlandau commented 6 years ago

On the other hand, I have tried and failed to micromanage the vertical ordering of the nodes. Maybe it's because of the directed/leveled positioning and default Sugiyama igraph layout in render_drake_graph(), but from what I recall from early development, I actually doubt this feature will turn out well as long as we are using visNetwork. Here is where I think ggraph could help us. I'm not exactly how exactly sure about the implementation, but it should be straightforward with the output of dataframes_graph(). With ggraph, we lose interactivity, but there is a lot to gain in return.

krlmlr commented 6 years ago

I saw that vis.js can do clustering, but I'm not sure if it helps. How does the ggraph output for the basic example look like?

wlandau commented 6 years ago

Not sure yet, but eager to finally try out ggraph!

wlandau commented 6 years ago

It looks like ggraph may not have clustering, but I will search harder. visNetwork has visClusteringByGroup, though I am having trouble making more than one cluster at a time.

library(drake)
library(visNetwork)
con <- load_basic_example()
df <- dataframes_graph(con)
df$nodes
df$nodes$group <- paste0(df$nodes$status, "_", df$nodes$type)
g <- render_drake_graph(df)
visGroups(g, groupname = "imported_function") %>%
  visGroups(groupname = "outdated_object") %>%
  visClusteringByGroup(
    groups = c("imported_function", "outdated_object"))

capture

krlmlr commented 6 years ago

Works for me with a variant of the ?visGroups example from visNetwork:

library(visNetwork)
nodes <- data.frame(id = 1:10, label = paste("Label", 1:10), 
 group = sample(c("A", "B"), 10, replace = TRUE))
 edges <- data.frame(from = c(2,5,10), to = c(1,2,10))

visNetwork(nodes, edges) %>%
 visLegend() %>%
 visGroups(groupname = "A", color = "red", shape = "database") %>%
 visGroups(groupname = "B", color = "yellow", shape = "triangle") %>%
 visClusteringByGroup(c("A", "B"))
wlandau commented 6 years ago

Thanks, Kirill! Is this the kind of clustering you were imagining? Do you think it would be enough to list all the target names in the cluster, maybe with the label argument of visClusteringByGroup()?

krlmlr commented 6 years ago

For a collapsed cluster I'd rather only see its label and not the detailed target names in the cluster. I haven't thought about clustering in interactive visualizations, but this does look useful. For graphviz-based renderers we can use a similar logic for specifying the groups, even if the display will look different (see first post).

AlexAxthelm commented 6 years ago

As a side note, to consider (after seeing the unconf thread) it might be worth searching the global environment for anything that looks like a drake plan (tibble with correct colnames would be a good start), and use that as a rough basis for clustering. I know that I usually have something along the lines of

data_plan <- drake_plan({importing data})
cleaning_plan <- drake_plan({cleaning functions})
analysis_plan <- drake_plan({analysis_functions})
reporting_plan({reporting functions})
master_plan <- bind_rows(data_plan, cleaning_plan, analysis_plan, reporting_plan)
make(master_plan)

If I could see a simplified graph with 4 target-ish objects, so that I can tell easily how long importing takes, or where in the plan the make failed, I would be happy. Maybe expanding collapsing sub-plans that didn't get touched yet, or that made successfully? This could be a non-default behavior, but if I have 1000+ targets, which all are at least similar, It would be nice (and improve render time), if I didn't have to see then all.

wlandau commented 6 years ago

I like the general idea. Subplans define the natural clustering that I see most people using. People typically combine their plans with bind_rows() or similar.

bind_rows(data_plan, cleaning_plan, analysis_plan, reporting_plan)

What about bind_plans()?

master_plan <- bind_plans(
  data = data_plan,
  cleaning = cleaning_plan,
  analysis = analysis_plan,
  reporting = reporting_plan
)

bind_plans() would add an extra cluster or subplan column, where the names of the clusters would respect the argument names you provide. Eventually, we could even designate different future resource types to different subplans (re: #169).

I hesitate to search the user's environment for (sub)plans because it seems a bit mysterious.

AlexAxthelm commented 6 years ago

I like the idea to explicitly label subplans. Having that be a separate column would also open the door to multiple levels of grouping.

rkrug commented 6 years ago

Also, when using bind_plans() one could add the name of the subplan as a prefix into the target name which would make it much easier to deal with duplicate target names in different sub-plans. This could even be disabled via an argument if not wished.

AlexAxthelm commented 6 years ago

To expand on my idea from above, now that I'm in front of a real keyboard, the idea would be to have multiple levels of grouping, along the lines of:

plan = tribble(
  ~target,   ~command,        ~group,
  "x",       "seq(1, 10)",    "import",
  "y",       "seq(10, 1)",    "import",
  "x_clean", "as.numeric(x)", "cleaning",
  "y_clean", "as.numeric(y)", "cleaning",
  "z",       "y + 10",        "analysis",
  "y_lm",    "lm(x ~ y)",     c("analysis", "linear"),
  "z_lm",    "lm(x ~ z)",     c("analysis", "linear"),
  "y_glm",   "glm(x ~ y)",     c("analysis", "general"),
  "z_glm",   "glm(x ~ z)",     c("analysis", "general")
) %>% print()
# A tibble: 9 x 3
#  target  command       group    
#  <chr>   <chr>         <list>   
#1 x       seq(1, 10)    <chr [1]>
#2 y       seq(10, 1)    <chr [1]>
#3 x_clean as.numeric(x) <chr [1]>
#4 y_clean as.numeric(y) <chr [1]>
#5 z       y + 10        <chr [1]>
#6 y_lm    lm(x ~ y)     <chr [2]>
#7 z_lm    lm(x ~ z)     <chr [2]>
#8 y_glm   glm(x ~ y)    <chr [2]>
#9 z_glm   glm(x ~ z)    <chr [2]>

so that if, for example, z_glm failed to build, the build graph would show the "import", "cleaning", and "linear" groups as groups, but expand the "general" group, so that I could see the failed object.

The underlying assumption here is that plans contain targets that act similarly, so if I have many similar objects, I don't need to see the details about them unless something is wrong.

A loose sketch of what I'm thinking: image

wlandau commented 6 years ago

Great ideas, Alex! It seems like we could implement them drake itself even before #282 is implemented. If we do it cleanly, not much in dataframes_graph() or vis_drake_graph() would need to change. We could just take the clusters from the group.

Permitting multiple groups (for example, c("analysis", "general") for z_glm) is the most complicated thing. Off the top of my head, I don't know if it makes sense for a pre-#282 implementation. I wonder if visNetwork supports clusters within clusters...

AlexAxthelm commented 6 years ago

It appears that clustering in visNetwork is still experimental. http://datastorm-open.github.io/visNetwork/more.html

I think trying the one-level clustering would be a good first step. My machine won’t boot right now, or I would play around with it myself.

wlandau commented 6 years ago

Sure, that sounds like a good plan for base drake. We can allow multiple groups in bind_plans() and then use the first group listed for each target. Separate tools can extend this to account for multiple groups.

wlandau commented 6 years ago

Re: https://github.com/ropensci/unconf18/issues/12#issuecomment-372709116, clusters are related to expansions and subplans in the DSL. cc @dapperjapper.

wlandau commented 6 years ago

I plan to start work on this in a new drakevis package once I have time to work on it in earnest.

wlandau commented 6 years ago

The cleanest solution I know falls right out of https://github.com/ropensci/drake/issues/376#issuecomment-402835393. Keeping wildcard information after expansion/evaluation seems massively useful for https://github.com/ropensci/drake/issues/229#issuecomment-372308031.

wlandau commented 6 years ago

6edf81686be7ad684cf0847a79ece55a60da0287 exposes all columns from the plan in drake_graph_info()$nodes, which gives us flexibility: clusters can be subplans, wildcards, etc. visNetwork clustering may not work out (https://github.com/datastorm-open/visNetwork/issues/254) but manual clustering should be straightforward.