Closed krlmlr closed 6 years ago
How should we choose the groupings? My intuition tells me that graph theory has a straightforward answer somewhere.
Could features like this one bud into their own drake
-focused visualization package? I believe drake
should natively support basic network visualizations, but the possibilities are endless, and the code base will likely be long, complicated and difficult to test.
I was looking only for manual grouping, perhaps with a new column in the plan data frame?
That sounds much easier.
On the other hand, I have tried and failed to micromanage the vertical ordering of the nodes. Maybe it's because of the directed/leveled positioning and default Sugiyama igraph
layout in render_drake_graph(), but from what I recall from early development, I actually doubt this feature will turn out well as long as we are using visNetwork
. Here is where I think ggraph
could help us. I'm not exactly how exactly sure about the implementation, but it should be straightforward with the output of dataframes_graph()
. With ggraph
, we lose interactivity, but there is a lot to gain in return.
I saw that vis.js can do clustering, but I'm not sure if it helps. How does the ggraph output for the basic example look like?
Not sure yet, but eager to finally try out ggraph
!
It looks like ggraph
may not have clustering, but I will search harder. visNetwork
has visClusteringByGroup
, though I am having trouble making more than one cluster at a time.
library(drake)
library(visNetwork)
con <- load_basic_example()
df <- dataframes_graph(con)
df$nodes
df$nodes$group <- paste0(df$nodes$status, "_", df$nodes$type)
g <- render_drake_graph(df)
visGroups(g, groupname = "imported_function") %>%
visGroups(groupname = "outdated_object") %>%
visClusteringByGroup(
groups = c("imported_function", "outdated_object"))
Works for me with a variant of the ?visGroups
example from visNetwork:
library(visNetwork)
nodes <- data.frame(id = 1:10, label = paste("Label", 1:10),
group = sample(c("A", "B"), 10, replace = TRUE))
edges <- data.frame(from = c(2,5,10), to = c(1,2,10))
visNetwork(nodes, edges) %>%
visLegend() %>%
visGroups(groupname = "A", color = "red", shape = "database") %>%
visGroups(groupname = "B", color = "yellow", shape = "triangle") %>%
visClusteringByGroup(c("A", "B"))
Thanks, Kirill! Is this the kind of clustering you were imagining? Do you think it would be enough to list all the target names in the cluster, maybe with the label
argument of visClusteringByGroup()
?
For a collapsed cluster I'd rather only see its label and not the detailed target names in the cluster. I haven't thought about clustering in interactive visualizations, but this does look useful. For graphviz-based renderers we can use a similar logic for specifying the groups, even if the display will look different (see first post).
As a side note, to consider (after seeing the unconf thread) it might be worth searching the global environment for anything that looks like a drake plan (tibble with correct colnames would be a good start), and use that as a rough basis for clustering. I know that I usually have something along the lines of
data_plan <- drake_plan({importing data})
cleaning_plan <- drake_plan({cleaning functions})
analysis_plan <- drake_plan({analysis_functions})
reporting_plan({reporting functions})
master_plan <- bind_rows(data_plan, cleaning_plan, analysis_plan, reporting_plan)
make(master_plan)
If I could see a simplified graph with 4 target-ish objects, so that I can tell easily how long importing takes, or where in the plan the make failed, I would be happy. Maybe expanding collapsing sub-plans that didn't get touched yet, or that made successfully? This could be a non-default behavior, but if I have 1000+ targets, which all are at least similar, It would be nice (and improve render time), if I didn't have to see then all.
I like the general idea. Subplans define the natural clustering that I see most people using. People typically combine their plans with bind_rows()
or similar.
bind_rows(data_plan, cleaning_plan, analysis_plan, reporting_plan)
What about bind_plans()
?
master_plan <- bind_plans(
data = data_plan,
cleaning = cleaning_plan,
analysis = analysis_plan,
reporting = reporting_plan
)
bind_plans()
would add an extra cluster
or subplan
column, where the names of the clusters would respect the argument names you provide. Eventually, we could even designate different future
resource types to different subplans (re: #169).
I hesitate to search the user's environment for (sub)plans because it seems a bit mysterious.
I like the idea to explicitly label subplans. Having that be a separate column would also open the door to multiple levels of grouping.
Also, when using bind_plans()
one could add the name of the subplan as a prefix into the target name which would make it much easier to deal with duplicate target names in different sub-plans. This could even be disabled via an argument if not wished.
To expand on my idea from above, now that I'm in front of a real keyboard, the idea would be to have multiple levels of grouping, along the lines of:
plan = tribble(
~target, ~command, ~group,
"x", "seq(1, 10)", "import",
"y", "seq(10, 1)", "import",
"x_clean", "as.numeric(x)", "cleaning",
"y_clean", "as.numeric(y)", "cleaning",
"z", "y + 10", "analysis",
"y_lm", "lm(x ~ y)", c("analysis", "linear"),
"z_lm", "lm(x ~ z)", c("analysis", "linear"),
"y_glm", "glm(x ~ y)", c("analysis", "general"),
"z_glm", "glm(x ~ z)", c("analysis", "general")
) %>% print()
# A tibble: 9 x 3
# target command group
# <chr> <chr> <list>
#1 x seq(1, 10) <chr [1]>
#2 y seq(10, 1) <chr [1]>
#3 x_clean as.numeric(x) <chr [1]>
#4 y_clean as.numeric(y) <chr [1]>
#5 z y + 10 <chr [1]>
#6 y_lm lm(x ~ y) <chr [2]>
#7 z_lm lm(x ~ z) <chr [2]>
#8 y_glm glm(x ~ y) <chr [2]>
#9 z_glm glm(x ~ z) <chr [2]>
so that if, for example, z_glm
failed to build, the build graph would show the "import", "cleaning", and "linear" groups as groups, but expand the "general" group, so that I could see the failed object.
The underlying assumption here is that plans contain targets that act similarly, so if I have many similar objects, I don't need to see the details about them unless something is wrong.
A loose sketch of what I'm thinking:
Great ideas, Alex! It seems like we could implement them drake
itself even before #282 is implemented. If we do it cleanly, not much in dataframes_graph()
or vis_drake_graph()
would need to change. We could just take the clusters from the group.
Permitting multiple groups (for example, c("analysis", "general")
for z_glm
) is the most complicated thing. Off the top of my head, I don't know if it makes sense for a pre-#282 implementation. I wonder if visNetwork
supports clusters within clusters...
It appears that clustering in visNetwork is still experimental. http://datastorm-open.github.io/visNetwork/more.html
I think trying the one-level clustering would be a good first step. My machine won’t boot right now, or I would play around with it myself.
Sure, that sounds like a good plan for base drake
. We can allow multiple groups in bind_plans()
and then use the first group listed for each target. Separate tools can extend this to account for multiple groups.
Re: https://github.com/ropensci/unconf18/issues/12#issuecomment-372709116, clusters are related to expansions and subplans in the DSL. cc @dapperjapper.
I plan to start work on this in a new drakevis
package once I have time to work on it in earnest.
The cleanest solution I know falls right out of https://github.com/ropensci/drake/issues/376#issuecomment-402835393. Keeping wildcard information after expansion/evaluation seems massively useful for https://github.com/ropensci/drake/issues/229#issuecomment-372308031.
6edf81686be7ad684cf0847a79ece55a60da0287 exposes all columns from the plan in drake_graph_info()$nodes
, which gives us flexibility: clusters can be subplans, wildcards, etc. visNetwork
clustering may not work out (https://github.com/datastorm-open/visNetwork/issues/254) but manual clustering should be straightforward.
Can visNetwork visually group related commands (manually specified by the user) in a subgraph-like setting?
From https://graphviz.org: