martin-borkovec / ggparty

147 stars 14 forks source link

Consider supporting easy visualization of terminal node sizes #16

Open henningsway opened 5 years ago

henningsway commented 5 years ago

I hope you don't mind my little feature suggestions. :)

I really like the approach taken by https://github.com/parrt/dtreeviz which visualizes Decision trees in a very clear and pleasing way. I like the visualization of cutpoints, but also the possibility to easily glimpse the size of the terminal nodes.

Good luck with the project!

martin-borkovec commented 5 years ago

No, don't mind it all. Thanks for your interest in this project!

Yes, that's an interesting idea. I am going to keep it in mind for further development.

henningsway commented 5 years ago

Would there be way to turn the node size into a pie chart (angles mapped to the proportions, size mapped to terminal node size) already?

Tried to pass

geom_nodeplot(gglist = list(geom_bar(aes(x = "", fill = sex),
                                     position = position_dodge()) +
                                     coord_polar("y")
                            ))

but this doesn't seem to be the right approach. ;-/

martin-borkovec commented 5 years ago

Do you mean like this? Had to add a new setting of "nodesize" for width and height. Well, actually it's mapped to the log of nodesize since the actual proportions are way too extreme.

Regarding your suggested code: be careful not to use + instead of a comma for the gglist argument. It has to be a normal list. I know this may be a pitfall for new users.

library(MASS)
library("partykit")
#> Loading required package: grid
#> Loading required package: libcoin
#> Loading required package: mvtnorm
SexTest <- ctree(sex ~ ., data=Aids2)
library(ggparty)
#> Loading required package: ggplot2
ggparty(SexTest) +
  geom_edge() + 
  geom_edge_label() +
  geom_node_splitvar() +
  geom_nodeplot(gglist = list(geom_bar(aes(x = "", fill = sex),
                                       position = position_fill()),
                              coord_polar("y"),
                              theme_void()),
                width = "nodesize",
                height = "nodesize"
  )

Created on 2019-03-21 by the reprex package (v0.2.1)

henningsway commented 5 years ago

This looks very good! I will try it very soon. :)

I think the dataset (and the resulting nodesizes) are quite imbalanced, which is why the choice of the log for the nodesize seems appropriate.

Leaving this transformation to the user is probably too verbose or difficult to implement (e.g. geom_nodeplot(width = log(nodesize)) or sth) I would think?

Maybe it instead of choosing both width and height just one option (area or size) may be what's needed in most usecases.

martin-borkovec commented 5 years ago

Leaving this transformation to the user is probably too verbose or difficult to implement (e.g. geom_nodeplot(width = log(nodesize)) or sth) I would think?

No, shouldn't be too troublesome to implement, I plan on doing this.

Maybe it instead of choosing both width and height just one option (area or size) may be what's needed in most usecases.

Not sure about that, separate width and height may also be very handy in many cases. But yes, adding another option, which takes care of both at once is a good idea!

henningsway commented 5 years ago

I just took this for a testdrive.

For dataset with say 50k rows and about a dozen terminal nodes the differences in the nodesize (ranging about 2000 to 5000) are currently barely visible. So a choice of the transformation would be very useful in this case.

PS: (unrelated) Is it possible to map the color of the edge_label to the variable selected?

martin-borkovec commented 5 years ago

For dataset with say 50k rows and about a dozen terminal nodes the differences in the nodesize (ranging about 2000 to 5000) are currently barely visible. So a choice of the transformation would be very useful in this case.

Yes, I'd imagine.

PS: (unrelated) Is it possible to map the color of the edge_label to the variable selected?

What exactly do you mean? like this?

library(MASS)
library("partykit")
#> Loading required package: grid
#> Loading required package: libcoin
#> Loading required package: mvtnorm
SexTest <- ctree(sex ~ ., data=Aids2)
library(ggparty)
#> Loading required package: ggplot2
ggparty(SexTest) +
  geom_edge(aes(col = splitvar), size = 1.5) + 
  scale_color_discrete(h.start = 100) +
  geom_edge_label() +
  geom_node_splitvar() +
  geom_nodeplot(gglist = list(geom_bar(aes(x = "", fill = sex),
                                       position = position_fill()),
                              coord_polar("y"),
                              theme_void()),
                width = "nodesize",
                height = "nodesize"
  )

Created on 2019-03-21 by the reprex package (v0.2.1)

henningsway commented 5 years ago

Awesome, I'll try this for the labels soon. Thank you!

martin-borkovec commented 5 years ago

oh, sorry... misread it. here you go:

library(MASS)
library("partykit")
#> Loading required package: grid
#> Loading required package: libcoin
#> Loading required package: mvtnorm
SexTest <- ctree(sex ~ ., data=Aids2)
library(ggparty)
#> Loading required package: ggplot2
ggparty(SexTest) +
  geom_edge() + 
  scale_color_discrete(h.start = 100) +
  geom_edge_label(aes(col = splitvar)) +
  geom_node_splitvar() +
  geom_nodeplot(gglist = list(geom_bar(aes(x = "", fill = sex),
                                       position = position_fill()),
                              coord_polar("y"),
                              theme_void()),
                width = "nodesize",
                height = "nodesize"
  )

Created on 2019-03-21 by the reprex package (v0.2.1)

martin-borkovec commented 5 years ago

update regarding node size: removed the option of mapping to node size for width and height, and introduced argument size instead which modifies both values at once by the provided multiplier. Can be set to "nodesize" or "log(nodesize)"

general update: changed name of geom_nodeplot to geom_node_plot

library(MASS)
library("partykit")
#> Loading required package: grid
#> Loading required package: libcoin
#> Loading required package: mvtnorm
library(ggparty)
#> Loading required package: ggplot2
SexTest <- ctree(sex ~ ., data=Aids2)
ggparty(SexTest) +
  geom_edge() + 
  geom_edge_label() +
  geom_node_splitvar() +
  geom_node_plot(gglist = list(geom_bar(aes(x = "", fill = sex),
                                       position = position_fill()),
                              coord_polar("y"),
                              theme_void()),
                size = "log(nodesize)"
  )

Created on 2019-03-26 by the reprex package (v0.2.1)

henningsway commented 5 years ago

Let me test this soon and get back to you. :)

henningsway commented 5 years ago

Well, it works and for me this issue would be solvend! :)

Two additional thoughts: